In my previous question, I wasn't very specific over my problem (scraping with an authenticated session with Scrapy), in the hopes of being able to deduce the solution from a more general answer. I should probably rather have used the word crawling
.
So, here is my code so far:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/login/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
if not "Hi Herman" in response.body:
return self.login(response)
else:
return self.parse_item(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'name': 'herman', 'password': 'password'},
callback=self.parse)]
def parse_item(self, response):
i['url'] = response.url
# ... do more things
return i
As you can see, the first page I visit is the login page. If I'm not authenticated yet (in the parse
function), I call my custom login
function, which posts to the login form. Then, if I am authenticated, I want to continue crawling.
The problem is that the parse
function I tried to override in order to log in, now no longer makes the necessary calls to scrape any further pages (I'm assuming). And I'm not sure how to go about saving the Items that I create.
Anyone done something like this before? (Authenticate, then crawl, using a CrawlSpider
) Any help would be appreciated.