0

I am Scraping https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all to collect college informations.

On the webpage below each College only one Course Name is given and rest of courses are scripted in JavaScript. For eg. +13 More Courses+

So I don't get their info when I use requests.get(url) .

How I can scrape such details using REQUESTS and BeautifulSoup? I use Anaconda Jupyter Notebook as IDE.

I have heard about Selenium but don't know about it. Since Selenium is bit heavy is there any possible lite alternative to load all the JavaScript contents at once.

I have also heard about Splash framework. If anyone knows about it and how to integrate it with Python Requests and BeautifulSoup please answer.

Things I have tried

1.PyQt

Reference: https://www.youtube.com/watch?v=FSH77vnOGqU

I have imported different libraries than in video depending on PyQt version in anaconda.

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebKitWidgets import QWebPage
import requests
from bs4 import BeautifulSoup

class Client(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()
    def on_page_load(self):
        self.app.quit()

url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"
client_response=Client(url)
src=client_response.mainFrame().toHtml()
soup = BeautifulSoup(src,"lxml")
tpm = soup.find_all("section",{"class":"tpl-curse-dtls more_46905_0"})
print(tpm)

Output: []

2. json() in Requests Module

import requests
from bs4 import BeautifulSoup

url="https://www.shiksha.com/b-tech/colleges/b-tech-colleges-mumbai-all"

r=requests.get(url)

a=r.json()

OUTPUT: JSONDecodeError: Expecting value: line 3 column 1 (char 3)

3. json.loads() from json module

Inspection Details on clicking

import json

j_url='https://www.shiksha.com//nationalCategoryList/NationalCategoryList/loadMoreCourses/'

def j_data(url=j_url):

    dt = tp[0].find_all("input",{"id":"remainingCourseIds_46905"})

    output = dt[0]['value']

    data = {
        'courseIds': '231298,231294,231304,231306',
        'loadedCourseCount': 0
        #'page':page
        }
    response = requests.post(url, data=data)
    return json.loads(r.content)
print(j_data())

OUTPUT: JSONDecodeError: Expecting value: line 3 column 1 (char 3)

DRYSCRAPE is not available for Windows

ou_ryperd
  • 1,595
  • 2
  • 16
  • 18
Abhay
  • 474
  • 8
  • 24

1 Answers1

2

You don't need to know what its Javascript does. Just click the link and use your browser inspector to observe the network request.

In your specific case, the Javascript sends a POST request to "/nationalCategoryList/NationalCategoryList/loadMoreCourses/". So you can send the same request and you'll get back a new HTML string. You can parse that string using BeautifulSoup and get the data you need.

There is an extra step above because the POST request needs a payload that specifies parameters. You should be able to find these parameters in the original page. Once you find them, you can look at their surrounding HTML elements and either use BeautifulSoup to extract them, or use regular expression to find them.

I hope it helps.

Maokai
  • 326
  • 1
  • 5