25

I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:

  • Set 1 or several start_urls
  • Set 1 or several allowed_domains
  • Set settings values
  • Start the spider
  • Stop / pause / resume a spider
  • retrieve some stats while running
  • retrive some stats after spider is complete.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I also looked at this question : How to give URL to scrapy for crawling? ; But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls, but it doesn't seem to allow allowed_domains or settings.

Then I gave a look to scrapy webservices : It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings

There are a several questions on this subject, none of them seems satisfactory:

I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)

Thanks a lot for your help.

Community
  • 1
  • 1
arno
  • 497
  • 1
  • 5
  • 14
  • Scrapy eggs are created with the `deploy` command; maybe you can check out the [Django Dynamic Scraper](https://github.com/holgerd77/django-dynamic-scraper) for hints on how integrate Scrapy spider control in Django. – Steven Almeroth Nov 03 '12 at 20:41
  • Have you looked at [scrapy tool](http://doc.scrapy.org/en/latest/topics/commands.html) or the [slybot project](https://github.com/scrapy/slybot) for inspiration? – jah Nov 05 '12 at 17:26
  • My answer http://stackoverflow.com/questions/9814827/creating-a-generic-scrapy-spider/13054768#13054768 may help – Supreet Sethi Dec 14 '12 at 07:51
  • You could run the spider as a normal python library: http://stackoverflow.com/questions/15564844/locally-run-all-of-the-spiders-in-scrapy/#15580406. – Steven Almeroth Aug 11 '13 at 15:54

4 Answers4

10

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I don't agree to the above statement, start_urls need not be hard-coded they can be dynamically passed to the class, you should be able to pass it as an argument like this

http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1

Or you should be able to retrieve the URLs from a database or a file. I get it from a database like this

class WikipediaSpider(BaseSpider):
    name = 'wikipedia'
    allowed_domains = ['wikipedia.com']
    start_urls = []

    def __init__(self, name=None, url=None, **kwargs):
        item = MovieItem()
        item['spider'] = self.name
        # You can pass a specific url to retrieve 
        if url:
            if name is not None:
                self.name = name
            elif not getattr(self, 'name', None):
                raise ValueError("%s must have a name" % type(self).__name__)
            self.__dict__.update(kwargs)
            self.start_urls = [url]
        else:
            # If there is no specific URL get it from Database
            wikiliks = # < -- CODE TO RETRIEVE THE LINKS FROM DB -->
            if wikiliks == None:
                print "**************************************"
                print "No Links to Query"
                print "**************************************"
                return None

            for link in wikiliks:
                # SOME PROCESSING ON THE LINK GOES HERE
                self.start_urls.append(urllib.unquote_plus(link[0]))

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        # Remaining parse code goes here
kiran.koduru
  • 2,080
  • 17
  • 23
8

For changing settings programmatically and running the scraper from within an app, here's what I got:

from scrapy.crawler import CrawlerProcess
from myproject.spiders import MySpider
from scrapy.utils.project import get_project_settings

os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.my_settings_module'
scrapy_settings = get_project_settings()
scrapy_settings.set('CUSTOM_PARAM', custom_vaule)
scrapy_settings.set('ITEM_PIPELINES', {})  # don't write jsons or anything like that
scrapy_settings.set('DOWNLOADER_MIDDLEWARES', {
   'myproject.middlewares.SomeMiddleware': 100,
})
process = CrawlerProcess(scrapy_settings)
process.crawl(MySpider, start_urls=start_urls)
process.start()
Amichai Schreiber
  • 1,249
  • 15
  • 15
3

This is actually really simple!

from mypackage.spiders import MySpider
from scrapy.crawler import CrawlerProcess

results = []

class MyPipeline(object):
    """ A custom pipeline that stores scrape results in 'results'"""
    def process_item(self, item, spider):
        results.append(dict(item))

process = CrawlerProcess({
    # An example of a custom setting
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',  
    'ITEM_PIPELINES': {'__main__.MyPipeline': 1},   # Hooking in our custom pipline above
})

start_urls=[
    'http://example.com/page1', 
    'http://example.com/page2',
]
process.crawl(MySpider, start_urls=start_urls)
process.start() # the script will block here until the crawling is finished

# Do something with the results
print results
Humphrey
  • 3,560
  • 2
  • 24
  • 26
0

I think you need to look at this

http://django-dynamic-scraper.readthedocs.org/en/latest/

This does somewhat similar what you want. It also uses the celery of task sheduling. You can see the code to have a look what he is doing. I think it will be easy if you modify his code to do what you want

It also has good docs on how to setup the interface with django

Mirage
  • 28,544
  • 56
  • 155
  • 251