4

I tried something like:

payload = {"project": settings['BOT_NAME'],
             "spider": crawler_name,
             "start_urls": ["http://www.foo.com"]}
response = requests.post("http://192.168.1.41:6800/schedule.json",
                           data=payload)

And when I check the logs, I got this error code:

File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 53, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 26, in __init__
    self._set_url(url)
  File "/usr/lib/pymodules/python2.7/scrapy/http/request/__init__.py", line 61, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
exceptions.ValueError: Missing scheme in request url: h

Looks like only the first letter of "http://www.foo.com" is used as request.url, and I really have no idea why.

Update

Maybe start_urls should be a string instead of a list containing 1 element, so I also tried:

"start_urls": "http://www.foo.com"

and

"start_urls": [["http://www.foo.com"]]

only to get the same error.

timfeirg
  • 1,106
  • 15
  • 30

1 Answers1

3

You could modify your spider to receive a url argument and append that to start_urls on init.

class MySpider(Spider):

    start_urls = []

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls.append(kwargs.get('url'))

    def parse(self, response):
        # do stuff

Your payload will now be:

payload = {
    "project": settings['BOT_NAME'],
    "spider": crawler_name,
    "url": "http://www.foo.com"
}
marven
  • 1,706
  • 1
  • 15
  • 14
  • I initially want this so I didn't have to override lots of methods under CrawlSpider, it's not flexible enough to stop when I want so I've made my mind to override it, however your answer looks extremely right so I'll accept without testing out. At last, could you enlighten me on the reason why the url didn't get passed? I tried to dig the code but it's yet too sophisticated for me. – timfeirg Aug 25 '14 at 10:35
  • `start_urls` becomes a string when it should be a list of strings. In the `start_requests` function of the base `Spider`, urls in `start_urls` is iterated through using a for loop (i.e. `for url in self.start_urls`) and when `start_urls` is a string instead of a list of strings, it fails as it gets a character instead of a valid url. – marven Aug 25 '14 at 11:16