I am Scraping https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all to collect college informations.
On the webpage below each College only one Course Name is given and rest of courses are scripted in JavaScript. For eg. +13 More Courses+
So I don't get their info when I use requests.get(url)
.
How I can scrape such details using REQUESTS and BeautifulSoup? I use Anaconda Jupyter Notebook as IDE.
I have heard about Selenium but don't know about it. Since Selenium is bit heavy is there any possible lite alternative to load all the JavaScript contents at once.
I have also heard about Splash framework. If anyone knows about it and how to integrate it with Python Requests and BeautifulSoup please answer.
Things I have tried
1.PyQt
Reference: https://www.youtube.com/watch?v=FSH77vnOGqU
I have imported different libraries than in video depending on PyQt version in anaconda.
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
import requests
from bs4 import BeautifulSoup
class Client(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self.on_page_load)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def on_page_load(self):
self.app.quit()
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
client_response=Client(url)
src=client_response.mainFrame().toHtml()
soup = BeautifulSoup(src,"lxml")
tpm = soup.find_all("section",{"class":"tpl-curse-dtls more_46905_0"})
print(tpm)
Output: []
2. json() in Requests Module
import requests
from bs4 import BeautifulSoup
url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
r=requests.get(url)
a=r.json()
OUTPUT: JSONDecodeError: Expecting value: line 3 column 1 (char 3)
3. json.loads() from json module
Inspection Details on clicking
import json
j_url='https://www.shiksha.com//nationalCategoryList/NationalCategoryList/loadMoreCourses/'
def j_data(url=j_url):
dt = tp[0].find_all("input",{"id":"remainingCourseIds_46905"})
output = dt[0]['value']
data = {
'courseIds': '231298,231294,231304,231306',
'loadedCourseCount': 0
#'page':page
}
response = requests.post(url, data=data)
return json.loads(r.content)
print(j_data())
OUTPUT: JSONDecodeError: Expecting value: line 3 column 1 (char 3)
DRYSCRAPE is not available for Windows