0

I am new to Python and web crawling. I intend to scrape links in the top stories of a website. I was told to look at to its Ajax requests and send similar ones. The problem is that all requests for the links are same: http://www.marketwatch.com/newsviewer/mktwheadlines My question would be how to extract links from an infinite scrolling box like this. I am using beautiful soup, but I think it's not suitable for this task. I am also not familiar with Selenium and java scripts. I know how to scrape certain requests by Scrapy though.

mk_sch
  • 912
  • 1
  • 11
  • 26

1 Answers1

0

It is indeed an AJAX request. If you take a look at the network tab in your browser inspector:

enter image description here

You can see that it's making a POST request to download the urls to the articles.
Every value is self explanatory in here except maybe for docid and timestamp. docid seems to indicate which box to pull articles for(there are multiple boxes on the page) and it seems to be the id attached to <li> element under which the article urls are stored.

Fortunately in this case POST and GET are interchangable. Also timestamp paremeter doesn't seem to be required. So in all you can actually view the results in your browser, by right clicking the url in the inspector and selecting "copy location with parameters":

http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true

This example has timestamp parameter removed as well as increased pullCount to 100, so simply request it, it will return 100 of article urls.

You can mess around more to reverse engineer how the website does it and what the use of every keyword, but this is a good start.

Granitosaurus
  • 17,068
  • 2
  • 45
  • 66
  • Thank you Granitosaurus for your very helpful hints. I couldn't find the inspector in my Google Chrome to get the url which you mentioned. Sorry, as I am new to this area. Also, which packages in Python should I use to get URLs in infinite scrolling containers? can I combine request package with beautiful soup or there is a better solution? – mk_sch Oct 27 '16 at 09:28
  • If you are using scrapy then use scrapy Requests for requests and scrapy Selectors for parsing the html. If you are not using scrapy then requests + beautiful soup or lxml works fine. Regarding inspection tools you should be able to acess them with F12 or ctrl+shift+c. – Granitosaurus Oct 27 '16 at 09:38
  • I am very thankful. I can get the Development tools in Chrome and see your screenshot, but I don't know where to right click and select "copy location with parameters" to get http://www.marketwatch.com/newsviewer/mktwheadlines?blogs=true&commentary=true&docId=1275261016&premium=true&pullCount=100&pulse=true&rtheadlines=true&topic=All%20Topics&topstories=true&video=true :( – mk_sch Oct 27 '16 at 09:56
  • I am digging to your answer, it's very helpful. Could you tell how did you get the URL that you posted in your answer? I couldn't find it . – mk_sch Nov 04 '16 at 12:18
  • @farshidbalan oh sorry for the late reply. It might differ for different developer tools but usually you can right click the url in the networks tab and it should have few different "copy" options. – Granitosaurus Nov 05 '16 at 00:26
  • It's very kind of you that helped me a lot :) I tried in Chrome and Firefox, there are different "copy" options like : copy URL, copy request headers, copy response and etc. but none of them gives me the link that you posted. That link helped me a lot to figure out how to write a request for scrolling down. Having more links would help me more :) – mk_sch Nov 05 '16 at 07:25
  • @farshidbalan oh, for the reference I'm using firebug for firefox as my inspector, which is a bit dated but still my favorite inspector. For chrome I found this related question: http://stackoverflow.com/questions/9057445/copy-location-with-parameters-for-google-chrome-developer-tools – Granitosaurus Nov 05 '16 at 10:07
  • Tanx alot dear Granitosaurus, I couldn't find it, but that's fine :) I will work on the link that you already posted. – mk_sch Nov 05 '16 at 10:55