5

I am trying to build an application using Flask and Scrapy. I have to pass the list of urls to spider. I tried using the following syntax:

__init__: in Spider
self.start_urls = ["http://www.google.com/patents/" + x for x in u]

Flask Method
u = ["US6249832", "US20120095946"]
os.system("rm static/s.json; scrapy crawl patents -d u=%s -o static/s.json" % u)

I know similar thing can be done by reading file having required urls, but can I pass list of urls for crawling?

alecxe
  • 414,977
  • 106
  • 935
  • 1,083
Sumit Gera
  • 1,199
  • 4
  • 15
  • 32

1 Answers1

6

Override spider's __init__() method:

class MySpider(Spider):
    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      endpoints = kwargs.get('start_urls').split(',')
      self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints]

And pass the list of endpoints through the -a command line argument:

scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json

See also:


Note that you can also run Scrapy from script:

Community
  • 1
  • 1
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
  • This looks like a very promising solution, this doesn't even require storing patent numbers in a list. Thanks. – Sumit Gera Feb 16 '15 at 17:23