0

Quite unsure with the informations available which class I should be inheriting from for a crawling spider.

My example below attempts to start with an authentication page and proceed to crawl all logged in pages. As per console output posted, it authenticates fine, but cannot output even the first page to JSON and halts after the first 200 status page:

I get this (new line, followed by left hard bracket):

JSON file

[

Console output

DEBUG: Crawled (200) <GET https://www.mydomain.com/users/sign_in> (referer: None)
DEBUG: Redirecting (302) to <GET https://www.mydomain.com/> from <POST https://www.mydomain.com/users/sign_in>
DEBUG: Crawled (200) <GET https://www.mydomain.com/> (referer: https://www.mydomain.com/users/sign_in)
DEBUG: am logged in
INFO: Closing spider (finished)

When running this:

scrapy crawl MY_crawler -o items.json

Using spider:

import scrapy
from scrapy.contrib.spiders.init import InitSpider
from scrapy.contrib.spiders import Rule
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors import LinkExtractor
from cmrcrawler.items import MycrawlerItem

class MyCrawlerSpider(InitSpider):
    name = "MY_crawler"
    allowed_domains = ["mydomain.com"]
    login_page = 'https://www.mydomain.com/users/sign_in'
    start_urls = [
        "https://www.mydomain.com/",
    ]

    rules = (
        #requires trailing comma to force iterable vs tuple
        Rule(LinkExtractor(), callback='parse_item', follow=True),

    )

    def init_request(self):

        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        auth_token = response.xpath('authxpath').extract()[0]

        return FormRequest.from_response(
            response,
            formdata={'user[email]': '***', 'user[password]': ***, 'authenticity_token': auth_token},
            callback=self.check_login_response)

    def check_login_response(self, response):

        if "Signed in successfully" in response.body:
            self.log("am logged in")
            self.initialized()

        else:
            self.log("couldn't login")
            print response.body

    def parse_item(self, response):

        item = MycrawlerItem()

        item['url'] = response.url
        item['title'] = response.xpath('//title/text()').extract()[0]

        yield item
ljs.dev
  • 4,031
  • 2
  • 40
  • 76
  • 1
    The reason why the crawler stops is usually since it's not able to find any valid links to continue crawling. Check that the input to parse_item actually is a logged in version of the site if it gets called. I'm also not sure about having `yield` and not `return` in parse_item, as you're not in a iterable context. – MatsLindh Jul 17 '14 at 21:58
  • 1
    Your `rules` attribute only means something for `CrawlSpider`, not for `InitSpider`. I suggest you subclass `CrawlSpider` but override the initial requests to plugin the login steps. For example you can read http://stackoverflow.com/a/5857202 and http://stackoverflow.com/a/22569515 – paul trmbrth Jul 18 '14 at 10:44

0 Answers0