1

I am trying to scrape a website that necessitates authentification to access the interesting information. I followed the indications here but for some reason, my Spider does not get past the login form page.

EDIT : Scrapy version used : 1.0.1

Here is the code of my spider:

# -*- coding: utf-8 -*-

import time
import re
import scrapy
from SST.items import WebsiteItem
from scrapy.spiders.init import InitSpider


class WebsiteSpider(InitSpider):
    name = 'WebOfScience'
    login_page = "https://website.com/userLogin.do"
    search_page = "http://apps.website.com"
    start_urls = ["http://apps.website.com"]

    def __init__(self, username="", password="", *args, **kwargs):
        """
        """
        super(scrapy.Spider, self).__init__(*args, **kwargs)
        self.http_user = username
        self.http_pass = password

    def init_request(self):
        """We initialize the spider by logging in."""
        return scrapy.Request(url=self.login_page,
                              callback=self.login, dont_filter=True)

    def login(self, response):
        """This function takes care of the login form"""
        request = scrapy.FormRequest.\
            from_response(response,
                          formdata={'username': self.http_user,
                                    'password': self.http_pass},
                          callback=self.check_login_response
                          )
        return request

    def check_login_response(self, response):
        """This function checks the login was successfull"""
        # Sleep 10s to make sure all the redirections are done.
        time.sleep(10)

        if re.search("loginFailed", response.url):
            scrapy.log("Login has failed.")
        # Otherwise, we should be on the right page to start crawling.
        else:
            # print response.body
            if "Basic Search" in response.body:
                scrapy.log("Crawling starts now !")
                return scrapy.Request(callback=self.parse)


    def parse(self, response):
        """We start parsing the results."""
        links = response.xpath(restrict_xpaths=("//a[contains('RECORD',id)]/value"))
        for link in links:
# do whatever we have to do

EDIT: the program never outputs "Login failed" or "Crawling starts now!". This is where the spider shuts down.

**EDIT: ** Here is the code of the form I'm trying to fill out.

<form name="userLoginForm" method="POST" onsubmit="return RegisterUserLogin()">
<p> &nbsp;</p>
<div align="center">
<table width="31%" border="0">
    <tbody><tr>
      <td width="53%" nowrap align="right">
        <p align="right">Enter your <A HREF="../html/help.htm#ID Number" title="User ID">User ID</A>:
      </p></td>
      <td width="50%" align="center">  
      <INPUT class="inputInput" id="j_username" name="j_username" TYPE=TEXT SIZE="20" value="" maxlength="100">
      </td>
    </tr>
    <tr>
      <td width="53%" nowrap align="right">
        <p align="right">
          Enter your <A HREF="../html/help.htm#PW" title="Password">Password</A>:
      </p></td>
      <td width="50%" align="center">  
      <INPUT class="inputInput" id="j_password" name="j_password" TYPE=PASSWORD SIZE="20"> 
       </td>
    </tr>
    <tr>
      <td width="103%" colspan="2" align="center">
        <p align="center"><font size="2"><input id="rememberme" name="rememberme" type="checkbox" value="ON" title="Remember Password"> Remember Password</font></p>
        <p align="center"><a href="forgotPassword.do" TITLE="Forgot My Password"><font size="2">Forgot My Password</font></a>
      </p></td>
    </tr>
    </tbody></table>
    <input id="j_auth_type" name="j_auth_type" type="hidden" value="UNP" />
</div>
<p></p>
<center><input value="Submit" type="Submit" title="Submit" >&nbsp; 
<input type="button" value="Clear" title="Clear the form" onclick="clearValues();"> 
<input id="userType" name="userType" type="hidden" value="user" />
<p></p>
</center>

<p><center>
[<a href="../html/custsupp.htm" title="Support">Support</a>]
[<a href="../html/help.htm" title="Help">Help</a>]
</center>
</p>
<hr>
</form>

Any ideas why the redirections are not being done with Scrapy ?

Thank you !

Community
  • 1
  • 1
Geoffrey Negiar
  • 578
  • 5
  • 18
  • The directions you are following are from 2011. Maybe you have a look at the latest [scrapy FAQ](http://doc.scrapy.org/en/1.0/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login) how to achieve login. And maybe you can edit your question and add your scrapy version and what requests are made and which ones not. – Frank Martin Aug 12 '15 at 15:13
  • Thanks for your answer @FrankMartin. I added the Scrapy version higher. I am not sure what you mean by which requests are made and which are not, could you explain a bit more ? Basically, after the login attemps, the Spider always gets sent back to the login form without mention of previous login : it's not a login fail. – Geoffrey Negiar Aug 14 '15 at 07:49
  • You have some 'scrapy.log' statements in your code. Which of those show up in the log? Does your code reach the "Crawling starts now !" ? My guess is that your Form_Request.from_response is missing a form data parameter. Without that no hidden extra fields are transmitted to the login page. Use formname, formxpath or formnumber (see [docs](http://doc.scrapy.org/en/1.0/topics/request-response.html#scrapy.http.FormRequest)) – Frank Martin Aug 14 '15 at 09:07
  • @FrankMartin I added the details you asked. There is a "hidden" field in the code of the form I want to fill out. The submit button also seems to call a JS script. How should I address this ? – Geoffrey Negiar Aug 14 '15 at 12:38
  • Inspect the POST parameters with your browsers developer tools and mimic those with your scrapy request. See [this SO post](http://stackoverflow.com/questions/16195788/firebug-inspect-post-from-webpage) for doing that with Firefox. – Frank Martin Aug 14 '15 at 12:56
  • @FrankMartin, there seems to be quite a lot of requests (and cookies made). there are 3 different POST requests when I use my browser to log in. How can I simulate this ? – Geoffrey Negiar Aug 14 '15 at 13:19
  • I can not know that (because I don't have an account for the website) and I also even don't know if it is possible at all to mimic the needed requests with scrapy (because of javascript usage). You really need to have a close look with the developer tools what's going on and try to simulate that with scrapy. (Look for the POST request with the 'j_username' and 'j_password' values ... that should be the one that you need to create with scrapy. – Frank Martin Aug 14 '15 at 13:31
  • @FrankMartin I tried doing that. I still get rerouted to the login page. The 2 other POST requests are made to another URL after redirection. There is also a bunch of GET requests, should I try to simulate these too ? – Geoffrey Negiar Aug 14 '15 at 13:59
  • Hmmm ... redirection after login is not unusual and scrapy normally is smart enough to follow those redirections. I don't think it makes sense to simulate other requests if your login-request fails. Search and read the scrapy / javascript stuff (selenium/phantomjs) and try to go with that. Sorry, that there is no easy solution. – Frank Martin Aug 14 '15 at 14:14
  • @FrankMartin Thank you for your help, I am checking out Selenium now ! – Geoffrey Negiar Aug 17 '15 at 09:39

0 Answers0