24

I want scrapy to crawl pages where going on to the next link looks like this:

<a href="#" onclick="return gotoPage('2');"> Next </a>

Will scrapy be able to interpret javascript code of that?

With livehttpheaders extension I found out that clicking Next generates a POST with a really huge piece of "garbage" starting like this:

encoded_session_hidden_map=H4sIAAAAAAAAALWZXWwj1RXHJ9n

I am trying to build my spider on the CrawlSpider class, but I can't really figure out how to code it, with BaseSpider I used the parse() method to process the first URL, which happens to be a login form, where I did a POST with:

def logon(self, response):
    login_form_data={ 'email': 'user@example.com', 'password': 'mypass22', 'action': 'sign-in' }
    return [FormRequest.from_response(response, formnumber=0, formdata=login_form_data, callback=self.submit_next)]

And then I defined submit_next() to tell what to do next. I can't figure out how do I tell CrawlSpider which method to use on the first URL?

All requests in my crawling, except the first one, are POST requests. They are alternating two types of requests: pasting some data, and clicking "Next" to go to the next page.

Steven Almeroth
  • 6,989
  • 1
  • 42
  • 52
ria
  • 5,688
  • 5
  • 24
  • 33
  • 1
    Give me some more context, scrapy can't interpret the javascript but you may be able to mimic the POST message that the javascript sends if you can find that (encoded_session_hidden_map) as some hidden form field or something. – Joshkunz Mar 24 '11 at 08:20

2 Answers2

3

The actual methodology will be as follows:

  1. Post your request to reach the page (as you are doing)
  2. Extract link to the next page from that particular response
  3. Simple Request the next page if possible or use FormRequest again in applicable

All this have to be streamlined with the server response mechanism, e.g:

  • You can try using dont_click = true in FormRequest.from_response
  • Or you may want to handle the redirection (302) coming from the server (in which case you will have to mention in the meta that you require the handle redirect request also to be sent to callback.)

Now how to figure it all out: Use a web debugger like fiddler or you can use Firefox plugin FireBug, or simply hit F12 in IE 9; and check the requests a user actually makes on the website match the way you are crawling the webpage.

Jason Sundram
  • 10,998
  • 19
  • 67
  • 84
Orochi
  • 375
  • 4
  • 13
-1

I built a quick crawler that executes JS via selenium. Feel free to copy / modify https://github.com/rickysahu/seleniumjscrawl

Ricky Sahu
  • 19,709
  • 4
  • 38
  • 30