Parallelism/Performance problems with Scrapyd and single spider

Question

Context

I am running scrapyd 1.1 + scrapy 0.24.6 with a single "selenium-scrapy hybrid" spider that crawls over many domains according to parameters. The development machine that host scrapyd's instance(s?) is an OSX Yosemite with 4 cores and this is my current configuration:

[scrapyd]
max_proc_per_cpu = 75
debug = on

Output when scrapyd starts:

2015-06-05 13:38:10-0500 [-] Log opened.
2015-06-05 13:38:10-0500 [-] twistd 15.0.0 (/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python 2.7.9) starting up.
2015-06-05 13:38:10-0500 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
2015-06-05 13:38:10-0500 [-] Site starting on 6800
2015-06-05 13:38:10-0500 [-] Starting factory <twisted.web.server.Site instance at 0x104b91f38>
2015-06-05 13:38:10-0500 [Launcher] Scrapyd 1.0.1 started: max_proc=300, runner='scrapyd.runner'

EDIT:

Number of cores:

python -c 'import multiprocessing; print(multiprocessing.cpu_count())' 
4

Problem

I would like a setup to process 300 jobs simultaneously for a single spider but scrapyd is processing 1 to 4 at a time regardless of how many jobs are pending:

Scrapy console with jobs

EDIT:

CPU usage is not overwhelming :

CPU Usage for OSX

TESTED ON UBUNTU

I have also tested this scenario on a Ubuntu 14.04 VM, results are more or less the same: a maximum of 5 jobs running was reached while execution, no overwhelming CPU consumption, more or less the same time was taken to execute the same amount of tasks.

Could you check if the multiprocessing module is counting your CPU cores correctly? This command should print 4: `python -c 'import multiprocessing; print(multiprocessing.cpu_count())'` — Elias Dorneles, Jun 07 '15 at 02:55
You can see from the logs that you will be allowed up to 300 processes, so I suspect you're hitting a bottleneck elsewhere. Are you suffering from the fact that scrapyd only schedules one spider at a time on a project? See http://stackoverflow.com/questions/11390888/running-multiple-spiders-using-scrapyd — Peter Brittain, Jun 20 '15 at 17:22
@PeterBrittain i found the clue for the solution in that related question, it was the POLL_INTERVAL , want the bounty? — gerosalesc, Jun 24 '15 at 18:51
Thanks! If you're offering, I won't turn it down at this stage in my membership... I'll post an answer now. — Peter Brittain, Jun 24 '15 at 21:28

score 0 · Answer 1 · edited May 23 '17 at 10:34

0

The logs show that you have up to 300 processes allowed. The limit is therefore further up the chain. My original suggestion was that it was the serialization on your project as covered by Running multiple spiders using scrapyd.

Subsequent investigation showed that the limiting factor was in fact the poll interval.

edited May 23 '17 at 10:34

Community

1
1

answered Jun 24 '15 at 21:32

Peter Brittain

12,717
3
34
49

score 0 · Accepted Answer · answered Jun 25 '15 at 20:51

My problem was that my jobs lasted for a time shorter that the POLL_INTERVAL default value which is 5 seconds, so no enough tasks were polled before the end of a previous one. Changing this settings to a value lower to the average duration of the crawler job will help scrapyd to poll more jobs for execution.

Parallelism/Performance problems with Scrapyd and single spider

Context

EDIT:

Problem

EDIT:

TESTED ON UBUNTU

2 Answers2