Arc Forumnew | comments | leaders | submitlogin
Http-gets periodically cause system fork error, possible solutions??
1 point by thaddeus 5070 days ago | 5 comments
So I have this web-scraping routine, for my personal stock screener app, that runs a function doing about 2000 or so http-get requests sequentially with a little processing time between each. Every once and a while I get a system fork error (I did a little research and it appears to be a case of the underlying OS getting wedged).

My set up: Arc3.1 (customized), Linode Ubuntu Jaunty, Apache with Mod Proxy to pass data to the arc server.

I have a few ideas that I think might help (or make it worse - lol).

1. I have already done some work to move towards nginx (thanks to paslecam)... I could make this a priority to move into production.

2. I could thread the http requests to run concurrently. The idea being that somehow the threads release memory faster/incrementally, although I don't see why a sequential routine wouldn't gc well enough. I also don't fully understand if network traffic eats RAM such that arc gc starts to be a factor - or not?)

3. I could slow down the routine to only do an http-get every x seconds or so, but if this is the case I am worried my server isn't robust enough to handle increased traffic.

4. I notice my CPU usage sits at 96% ish. I do quite a few file writes (2 writes and 2 deletes per request) so I am guessing my CPU is spending a lot on this. I could load the results of the request directly to memory vs. saving a file then loading to memory and also I could save all results at the end vs. incrementally saving.

Just wondering, based upon experiences, what factors make sense to focus on (In retrospect # 4 is starting to appear the best option).

[Edit - new error from tonights run, haven't seen this one until now: ... PLT Scheme virtual machine has run out of memory; aborting Aborted]



2 points by thaddeus 5068 days ago | link

Just as an update (if anyone is interested!).

I've moved 3 of my 4 domains over to the new server (Ubuntu10.4+Nginx). While I don't have any real evidence or even enough traffic to prove nginx is better than Apache, I am quit happy with the move. I didn't want to start spewing benchmarks, since the configuration settings are different[1], but I will say this: My first run at setting up Apache, as a total newb, took 3+ days enduring far too much complexity and troubleshooting ending in poor results. I have accomplished so much more with nginx in 3-4 hours.I know I am spewing a bunch stuff with out much real evidence, but for anyone else using Apache as a front end server I highly recommend considering nginx.

As for my original problem, I stopped using the http-get library and instead use wget with one other change (getting all the files downloaded first) I've cut the process time from 2 hours to 48 minutes. So far have received no errors. My CPU still churns at 96%.

[1] I hadn't applied re-write rules with my Apache set up, but nginx was dead simple to separate what I needed relayed to Arc vs. what to pull direct:

  location / {
  http://127.0.0.1:9000;
  }

  location /css {
  root home/arc/static/;
  expire max;}

  location /js {
  root home/arc/static/;
  expire max;}

  location /images {
  root home/arc/static/;
  expire max;}

  I'm sure I could have done something like:

  location /css /js /images{
  root home/arc/static/;
  expire max;}

  but I've only used it a little bit.
+ I have so many more options/tweaks I have been able to easily incorporate without any headaches.

-----

1 point by conanite 5069 days ago | link

Just in case gc is an issue - you might make use of (memory) to log memory usage and better determine if that's part of the problem. Is the system fork error raised inside the arc process or in an external process? I don't understand in (2) how slowing down the routine results in increased traffic - I'm missing some of the picture there.

-----

1 point by thaddeus 5069 days ago | link

Thanks for replying.

> System fork error raised inside the arc process?

hmmmm good question I am pretty sure it dropped out of arc and then provided a system error.

> I don't understand in (2) how slowing down the routine results in increased traffic.

It's not that I think it will increase traffic, just that if I have to slow the requests down to somehow release connections or memory, then it's not really a solution since I would like to be capable of handling future increased traffic(to much larger degrees than my 2000 connections for stocks :).

I think it's just my poor understanding of how threading/memory allocation/networking & gc work. Somehow I have it in my head that the process is not gc'ing memory or releasing connections from previous iterations before it moves on to the next, or not releasing the underlying OS thread for the get request before moving on to the next. That's probably all wrong, correct?

The process is pretty simple: download a file, load data from file to memory (assign to variable), parse the data and do the math, write the results to file, write the progress to file, wipe the variable. Rinse, Wash, & Repeat :)

I am just moving over to my new Ubuntu 10.4 Linode with Nginx. Setup is complete, data is copying over. Then I plan to run some benchmarks with out changing any code. Do a before and after to see the difference.

From there I'll start looking at that old http-get library to see if it's somehow not releasing memory (yikes!)

-----

1 point by aw 5069 days ago | link

is http-get something called from Arc or is it something you're using to fetch pages from your Arc server?

-----

1 point by thaddeus 5069 days ago | link

http-get is the arc library I use...

http://github.com/nex3/arc/tree/arc2.master/lib/http-get/

I'm using the save-page function.

[Edit - Sorry, in retrospect, I don't know why I didn't reference it to begin with, I just naturally Assume everyone knows what I am talking about Lol!)

..... I suppose I could just use wget or curl which would at least rule in/out that library!

Thanks for getting me thinking on this!

-----