0

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:

import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *

baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10 

In the corresponding HTML , there is an element like:

 <li class="loading">Loading more headlines...</li>

that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links. My first try was to use Beautiful Soup and to write the following code to get links and ids :

url = 'http://www.marketwatch.com/investing/stock/xom' 
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'}) 

and then check if there is more link to scrape, get the next json file:

loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
    # get the links from json file and load more links

I don't know hot to implement the comment part. do you have any idea about it? I am not obliged to use BeautifulSoup, and any other working library will be fine.

mk_sch
  • 912
  • 1
  • 11
  • 26
  • 1
    I'd do another approach: Open the website, open your browser's devtools and switch to the "network" tab, then on the page scroll down until it loads more headlines. Watch your devtool's network tab, you'll see some requests to another URL. You can play with the date/time parameter of that URL to retrieve the headlines, and it'll come out as JSON (even easier to parse than HTML). – chrki Nov 19 '16 at 09:34
  • Thank U chrki, that seems an interesting alternative, but I have no idea how to work with date/time to scrape data, as ''messageNumber'' changes for each news scrolled page and it counts down by "count": "10". If I get data for 3 years, I am done, and your idea is nice, but could you guide me how to implement it please? – mk_sch Nov 19 '16 at 09:49
  • Even if I removed the date/time parameter, json file would not be changed !!! – mk_sch Nov 19 '16 at 09:53
  • 1
    Basically: Retrieve the JSON file, look for the oldest date and time, and insert that date and time into the 2nd request. Then repeat that over and over until you've reached the limit of what you need. You can also change the `&count=` parameter in the URL to something higher, say 100, to retrieve more news at a time. I might take a look later. **Edit:** Or change one of the other parameters, you are right changing the date does nothing. Someone posted an answer. – chrki Nov 19 '16 at 09:55
  • Thank you chrki, I am going to implement the answer and also take your advice – mk_sch Nov 19 '16 at 10:52

1 Answers1

2

Here is how you can load more json file:

  1. get last json file, extract value of key UniqueId in last item.
    1. if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
      1. extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
      2. extract 8499 as messageNumber
      3. let docId be empty
    2. if the value is something looks like 1222712881
      1. let sequence be empty
      2. let messageNumber be empty
      3. extract 1222712881 as docId
  2. put parameters sequence, messageNumber, docId into your parameters2.
  3. use requests.get(baseUrl, params = parameters2) to get your next json file.
Anyany Pan
  • 499
  • 3
  • 9
  • @Anyanyn: Thank you very much, it seems to work well, I'll implement it. – mk_sch Nov 19 '16 at 10:51
  • I followed your clear hints, I get the first json file but the problem is that when I change docId, messagNumber or any other key, the output is exactly same and I cannot send the next request to get new links. Do I do anything wrong? – mk_sch Nov 23 '16 at 06:59
  • 1
    @farshidbalan from the HTML page, there are `
  • ` nodes which contain `data-uniqueid` attribute. Its value is from `UniqueId`, so extract its value to initialize `sequence` and `messageNumber`, or `docId`. And set `channelName` to `/news/latest/company/us/xom`. I had successfully gotten json files, if you still could not make it, just ask me for code.
  • – Anyany Pan Nov 24 '16 at 02:52
  • @ Anyany: I deeply appreciate your help. Thanks to your kind help, I could get the next json by changing the messageNumber. After 4 requests, all requests use only docId. I am new to programming and I don't know after sending the first json(with docId ) and getting response, how to extract the new docId (as they seem to be random in each request) and then send another request. As docId only exists inside the HTML file and I have no idea how to connect HTML to json . – mk_sch Nov 24 '16 at 06:45
  • @ Anyany: I also asked the question in stackoverflow, may some other people have similar problems :) http://stackoverflow.com/questions/40780492/gettin-html-element-and-sending-new-json-requests-in-python – mk_sch Nov 24 '16 at 07:40
  • @farshidbalan i have post my completed solution to your problem. hope it help. – Anyany Pan Nov 24 '16 at 08:12