8

I am running Scrapyd and encounter a weird issue when launching 4 spiders at the same time.

2012-02-06 15:27:17+0100 [HTTPChannel,0,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,1,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,2,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:17+0100 [HTTPChannel,3,127.0.0.1] 127.0.0.1 - - [06/Feb/2012:14:27:16 +0000] "POST /schedule.json HTTP/1.1" 200 62 "-" "python-requests/0.10.1"
2012-02-06 15:27:18+0100 [Launcher] Process started: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545 
2012-02-06 15:27:19+0100 [Launcher] Process finished: project='thz' spider='spider_1' job='abb6b62650ce11e19123c8bcc8cc6233' pid=2545 
2012-02-06 15:27:23+0100 [Launcher] Process started: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546 
2012-02-06 15:27:24+0100 [Launcher] Process finished: project='thz' spider='spider_2' job='abb72f8e50ce11e19123c8bcc8cc6233' pid=2546 
2012-02-06 15:27:28+0100 [Launcher] Process started: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547 
2012-02-06 15:27:29+0100 [Launcher] Process finished: project='thz' spider='spider_3' job='abb76f6250ce11e19123c8bcc8cc6233' pid=2547 
2012-02-06 15:27:33+0100 [Launcher] Process started: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549 
2012-02-06 15:27:35+0100 [Launcher] Process finished: project='thz' spider='spider_4' job='abb7bb8e50ce11e19123c8bcc8cc6233' pid=2549 

I already have these settings for Scrapyd:

[scrapyd]
max_proc = 10

Why isn't Scrapyd running the spiders at the same time, as quick as they are scheduled?

Sjaak Trekhaak
  • 4,596
  • 27
  • 36

2 Answers2

8

I've solved it by editing scrapyd/app.py on line 30.

Changed timer = TimerService(5, poller.poll) to timer = TimerService(0.1, poller.poll)

EDIT: The comment below by AliBZ regarding the configuration settings is a better way to change the polling frequency.

Sjaak Trekhaak
  • 4,596
  • 27
  • 36
  • 4
    According to [scrapyd](https://github.com/scrapy/scrapyd/blob/master/scrapyd/app.py), You can add `poll_interval = 0.1` to your scrapyd config file located at `/etc/scrapyd/conf.d/000-default`. – AliBZ Mar 29 '14 at 18:48
5

From my experience with scrapyd, it doesn't run a spider immediately as you schedule one. It usually waits a little bit, until the current spider is up and running, then it starts the next spider process (scrapy crawl).

So, scrapyd launches processes one by one until max_proc count is reached.

From your log i see that each of your spiders is running about 1 second. I think, you will see all your spiders running if they will run at least 30 seconds.

warvariuc
  • 50,202
  • 34
  • 156
  • 216
  • Yep; thats what I noticed as well. I've implemented a subprocess.Popen call to scrape instantly, as results have be displayed instantly. I was hoping to speed up Scrapyd's scheduler somehow :) – Sjaak Trekhaak Feb 07 '12 at 09:36
  • I think it's logical what scrapyd currently does. It doesn't want to overload the system starting many spiders simultaneously - it doesn't know if the spider you are scheduling for run is heavy one or light. That's why it runs spiders one by one. You can study scrapyd code and maybe you find something to tweak. If you find the answer useful, please upvote. – warvariuc Feb 07 '12 at 12:00