Crawling LinkedIn while authenticated with Scrapy

Question

So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful.

I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense.

====== UPDATED ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items

The issue was resolved by adding 'Return' in front of self.initialized()

Thanks Again! -Mark

`'request_depth_max': 1, 'scheduler/memory_enqueued': 3, 'start_time': datetime.datetime(2012, 6, 8, 18, 31, 48, 252601)} 2012-06-08 14:31:49-0400 [LinkedPy] INFO: Spider closed (finished) 2012-06-08 14:31:49-0400 [scrapy] INFO: Dumping global stats:{}` — Gates, Jun 08 '12 at 18:38
This sort of information should be put in your original question rather than comments. — Acorn, Jun 09 '12 at 11:01
@Acorn I will update my post above now, see if we cannot figure out whats going on.. — Gates, Jun 11 '12 at 15:54
Does `SgmlLinkExtractor` apply to `login_page` (or the one after it loads) or `start_urls ` — Gates, Jun 11 '12 at 16:35
The rules are used to define how links should be extracted from crawled pages, so those pages defined in `start_urls` and all other pages reached while crawling from them. — Acorn, Jun 11 '12 at 16:58
@Acorn Okay that makes more sense, well can you help with this than, I want to to crawl all the results of the pages in the search. I still cannot figure out how to get it to goto the search page and crawl that.. is because the Rules is blocking it? — Gates, Jun 11 '12 at 17:24
@ACorn I've interchanged many things and I cannot get it work, any ideas? — Gates, Jun 12 '12 at 19:43

score 4 · Answer 1 · edited May 23 '17 at 12:15

4

class LinkedPySpider(BaseSpider):

should be:

class LinkedPySpider(InitSpider):

Also you shouldn't override the parse function as I mentioned in my answer here: https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

If you don't understand how to define the rules for extracting links, just have a proper read through the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

edited May 23 '17 at 12:15

Community

1
1

answered Jun 08 '12 at 18:22

Acorn

44,010
23
124
163

That did help. I see a log of Success. **But** I am not sure the `def parse(self, response):` is actually running. I tried putting a self.log() into there and nothing returned. – Gates Jun 08 '12 at 18:28
It seems `parse()` should be `parse_item()` – Gates Jun 08 '12 at 19:06
There is a GOOD chance the problem has to do with the above and `allow=r'-\w+.html$'` as I do not know what this is.. – Gates Jun 08 '12 at 19:08
(Updated based off these changes) – Gates Jun 11 '12 at 16:02

Crawling LinkedIn while authenticated with Scrapy

1 Answers1

Linked