I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:
- Set 1 or several
start_urls
- Set 1 or several
allowed_domains
- Set
settings
values - Start the spider
- Stop / pause / resume a spider
- retrieve some stats while running
- retrive some stats after spider is complete.
At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls
, allowed_domains
, settings
) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.
I also looked at this question : How to give URL to scrapy for crawling? ;
But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls
, but it doesn't seem to allow allowed_domains
or settings
.
Then I gave a look to scrapy webservices :
It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings
There are a several questions on this subject, none of them seems satisfactory:
- using-one-scrapy-spider-for-several-websites This one seems outdated, as scrapy has evolved a lot since 0.7
- creating-a-generic-scrapy-spider No accepted answer, still talking around tweaking shell parameters.
I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)
Thanks a lot for your help.