I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:
import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *
baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10
In the corresponding HTML , there is an element like:
<li class="loading">Loading more headlines...</li>
that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links. My first try was to use Beautiful Soup and to write the following code to get links and ids :
url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})
and then check if there is more link to scrape, get the next json file:
loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links
I don't know hot to implement the comment part. do you have any idea about it? I am not obliged to use BeautifulSoup, and any other working library will be fine.