33

In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling.

So, here is my code so far:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['domain.com']
    start_urls = ['http://www.domain.com/login/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
    )

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        if not "Hi Herman" in response.body:
            return self.login(response)
        else:
            return self.parse_item(response)

    def login(self, response):
        return [FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.parse)]


    def parse_item(self, response):
        i['url'] = response.url

        # ... do more things

        return i

As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse function), I call my custom login function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.

The problem is that the parse function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.

Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider) Any help would be appreciated.

Community
  • 1
  • 1
Herman Schaaf
  • 39,417
  • 19
  • 92
  • 137
  • How the server knows that you authenticated? Make `CrawlSpider` pass appropriate cookies or other authentication tokens. – jfs May 02 '11 at 13:19
  • 1
    @J.F. Sebastian: From what I read somewhere in the Scrapy docs, Scrapy does this automatically, unless you switch that option off. – Herman Schaaf May 02 '11 at 13:25
  • This answer references github for the self.initialized() function, but that URL no longer works. Anyone know where I might find that? – fitzgeraldsteele Mar 19 '12 at 21:30
  • 1
    @fitzgeraldsteele: I've just fixed all the broken links in my answer. – Acorn Jun 08 '12 at 17:13

4 Answers4

57

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn't override the parse function. There's a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.


Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that's responsible for this here (it has helpful docstrings).


An example:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com/login'
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don't hesitate to pop open a new question and I'll do my best to help.

madzohan
  • 9,868
  • 9
  • 32
  • 60
Acorn
  • 44,010
  • 23
  • 124
  • 163
  • Regarding the chosen answer, this line will not work: return FormRequest.from_response(response, formdata={'name': 'herman', 'password': 'password'}, callback=self.check_login_response) In this instance, response is not defined. How are you handling that in a working solution? I'm stuck on this issue. – johndavidback Jun 06 '12 at 19:35
  • @johndavidback: I've just updated my answer to fix all the broken links and the code. Does the solution work for you now? – Acorn Jun 08 '12 at 17:15
  • 6
    InitSpider no longer inherits from CrawlSpider https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/init.py – tatsuhirosatou Feb 25 '13 at 17:09
  • This is a fantastic answer – Jamie S Jul 21 '14 at 07:53
  • @tatsuhirosatou, the new InitSpider didn't work for me or I didn't use it write. @Acorn, returning `self.initialize()` got it working for me, see http://stackoverflow.com/questions/11271928/scrapy-init-self-initialized-not-initializing?answertab=votes#tab-top – Mikeumus May 09 '15 at 19:43
  • 1
    Lastly, using `def parse()` got it scraping for me, see http://stackoverflow.com/questions/12591849/xpath-error-spider-error-processing?answertab=votes#tab-top – Mikeumus May 09 '15 at 19:54
  • @Acorn per the answers below I believe your line should be `return self.initialized()` rather than just calling it as you have in your code. – YPCrumble Apr 22 '16 at 18:28
  • 2
    thanks @Mikeumus ... fixed answer for those who want just do copy&paste xD – madzohan Mar 19 '18 at 09:23
  • 1
    `.contrib` is no longer required. I know I need to login but if I remove the rule the initialization never happens. In my case I don't see the need for a rule and not sure what it would be anyway. No login occurs without the rule though. – pferrel Jun 19 '18 at 00:08
4

In order for the above solution to work, I had to make CrawlSpider inherit from InitSpider, and no longer from BaseSpider by changing, on the scrapy source code, the following. In file scrapy/contrib/spiders/crawl.py:

  1. add: from scrapy.contrib.spiders.init import InitSpider
  2. change class CrawlSpider(BaseSpider) to class CrawlSpider(InitSpider)

Otherwise the spider wouldn't call the init_request method.

Is there any other easier way?

Maxime
  • 1,918
  • 1
  • 26
  • 38
viniciusnz
  • 107
  • 1
  • 6
  • This really should be a separate question, and not an answer, but thinking back quickly, I think it might depend on the method you use to invoke your crawlspider (via the commandline). Otherwise, I'd have to go have a look at the code again to see. – Herman Schaaf Jan 06 '12 at 05:21
  • Hi, thanks for the response. Well, complementing the response with the above worked for me and I could get the methods init_request to execute, whereas without doing that it didn't work. I start my spiders with "scrapy crawl spider_name". Best, – viniciusnz Jan 07 '12 at 05:18
  • Just to clarify, I had to change the inheritance chain from A) BaseSpider->InitSpider (where methods are) + BaseSpider->CrawlSpider to B) BaseSpider->InitSpider->CrawlSpider so my Crawl Spider could override the init_request method – viniciusnz Jan 07 '12 at 05:20
  • I had to the same kind of hack (in fact I took the code of CrawlSpider to make my own crawler using authentication). @Acorn, can you tell us wich scrapy version you were using ? – Maxime Jan 12 '12 at 14:16
  • @Maxime: You're right, the init_request functionality is only in InitSpider, my bad. – Acorn Jun 08 '12 at 17:22
2

Just adding to Acorn's answer above. Using his method my script was not parsing the start_urls after the login. It was exiting after a successful login in check_login_response. I could see I had the generator though. I needed to to use

return self.initialized()

then the parse function was called.

knowingpark
  • 470
  • 4
  • 11
2

If what you need is Http Authentication use the provided middleware hooks.

in settings.py

DOWNLOADER_MIDDLEWARE = [ 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware']

and in your spider class add properties

http_user = "user"
http_pass = "pass"
Thiago Macedo
  • 5,575
  • 1
  • 19
  • 22
bdargan
  • 1,190
  • 7
  • 16
  • form authentications using basic auth will send a www-authenticate header with the username/password. so generally that should just work. – bdargan Mar 20 '12 at 23:10