0

I'm trying to scrape data from this review site. It first go through first page, check if there's a 2nd page then go to it too. Problem is when getting to 2nd page. Page takes time to update and I still get the first page's data instead of 2nd

For example, if you go here, you will see how it takes time to load page 2 data

I tried to put a timeout or sleep but didn't work. Prefer a solution with minimal package/browser dependency (like webdriver.PhantomJS()) as I need to run this code on my employer's environment and not sure if I can use it. Thank you!!

from urllib.request import Request, urlopen
from time import sleep
from socket import timeout
    
req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
    
web_byte = urlopen(req, timeout=10).read()
    
webpage = web_byte.decode('utf-8')
parsed_html = BeautifulSoup(webpage, features="lxml")
    
true=parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})
    
while(true):
                                       
    true = parsed_html.find('div', {'class':['Grid-cell--1of12 pagination-arrows pagination-arrows-right']})

    if(not True):
       true=False
    else: 
       req = Request(softwareadvice+'?review.page=2', headers=hdr)
       sleep(10)
       webpage = urlopen(req, timeout=10)
       sleep(10)
       webpage = webpage.read().decode('utf-8')
       parsed_html = BeautifulSoup(webpage, features="lxml")
SIM
  • 20,216
  • 3
  • 27
  • 78
ju hu
  • 19
  • 4

2 Answers2

1

The reviews are loaded from external source via Ajax request. You can use this example how to load them:

import re
import json
import requests
from bs4 import BeautifulSoup


url = "https://www.softwareadvice.com/sms-marketing/twilio-profile/reviews/"
api_url = (
    "https://pkvwzofxkc.execute-api.us-east-1.amazonaws.com/production/reviews"
)

params = {
    "q": "s*|-s*",
    "facet.gdm_industry_id": '{"sort":"bucket","size":200}',
    "fq": "(and product_id: '{}' listed:1)",
    "q.options": '{"fields":["pros^5","cons^5","advice^5","review^5","review_title^5","vendor_response^5"]}',
    "size": "50",
    "start": "50",
    "sort": "completeness_score desc,date_submitted desc",
}

# get product id
soup = BeautifulSoup(requests.get(url).content, "html.parser")
a = soup.select_one('a[href^="https://reviews.softwareadvice.com/new/"]')
id_ = int("".join(re.findall(r"\d+", a["href"])))

params["fq"] = params["fq"].format(id_)

for start in range(0, 3):  # <-- increase the number of pages here
    params["start"] = 50 * start

    data = requests.get(api_url, params=params).json()

    # uncomment this to print all data:
    # print(json.dumps(data, indent=4))

    # print some data:
    for h in data["hits"]["hit"]:
        if "review" in h["fields"]:
            print(h["fields"]["review"])
            print("-" * 80)

Prints:

After 2 years using Twilio services, mainly phone and messages, I can say I am so happy I found this solution to handle my communications. It is so flexible,  Although it has been a little bit complicated sometimes to self-learn about online phoning systems it saved me from a lot of hassles I wanted to avoid. The best benefit you get is the ultra efficient support service
--------------------------------------------------------------------------------
An amazingly well built product -- we rarely if ever had reliability issues -- the Twilio Functions were an especially useful post-purchase feature discovery -- so much so that we still use that even though we don't do any texting.  We also sometimes use FracTEL, since they beat Twilio on pricing 3:1 for 1-800 texts *and* had MMS 1-800 support long before Twilio. 
--------------------------------------------------------------------------------
I absolutely love using Twilio, have had zero issues in using the SIP and text messaging on the platform.
--------------------------------------------------------------------------------
Authy by Twilio is a run-of-the-mill 2FA app. There's nothing special about it. It works when you're not switching your hardware.
--------------------------------------------------------------------------------
We've had great experience with Twilio. Our users sign up for text notification and we use Twilio to deliver them information. That experience has been well-received by customers. There's more to Twilio than that but texting is what we use it for. The system barely ever goes down and always shows us accurate information of our usage.
--------------------------------------------------------------------------------

...and so on.
Andrej Kesely
  • 81,807
  • 10
  • 31
  • 56
  • Thank for answer. This is response you got is still 1st page I think. How can I get 2nd, 3rd ... – ju hu May 11 '21 at 22:54
  • 1
    @juhu No, this will get all pages. There is `for`-loop that increases `start` variable and the `requests` will load additional reviews each iteration of this for loop. – Andrej Kesely May 11 '21 at 22:57
  • Oh ok thanks, will try it out. What's api_url and params btw? Specific to this example? What if want to do something similar for another product/company – ju hu May 11 '21 at 23:02
  • 1
    @juhu the `api_url` should be the same. Only thing that changes is product id (I parse it to variable `id_`) – Andrej Kesely May 11 '21 at 23:05
  • Mmm. Is this the most straightforward way to do it? Can't I just use some sleep/timeout kinda thing? Also, would probably want the number of pages to be figured by the system, not inputted manually. Appreciate your answers btw :) – ju hu May 11 '21 at 23:11
  • @juhu If you uncomment `print(json.dumps(data, indent=4))` you will see there is also total number of reviews. So you will need only to get this number. – Andrej Kesely May 11 '21 at 23:14
  • Do timeout/sleep not work for waiting before parsing? Having trouble generalizing with ur code :( – ju hu May 11 '21 at 23:34
  • You could use `sleep()` the issue may be in productivity (didn't read code but...) because You would have to set a certain amount and it will always take the same time, also it could happen that it isn't yet loaded after the sleep so You have to be sure. Otherwise You can use that, so I guess You could just test it out Yourself maybe – Matiiss May 11 '21 at 23:47
  • but actually if You were to use `selenium` it has a funciton `await` or sth like that which I think can be set to wait until page loads, but I don't know much about it so You will have to take a look at info about selenium – Matiiss May 11 '21 at 23:50
  • How do you know loaded through Ajax request? – ju hu May 18 '21 at 03:58
1

I have been scraping many types of websites and I think in the world of scraping, there are roughly 2 types of websites.

The first one is "URL-based" websites (i.e. you send request with URL, the server responds with HTML tags from which elements can be directly extracted), and the second one is "JavaScript-rendered" websites (i.e. the response you only get is the javascript and you can only see HTML tags after it is run).

In former's cases, you can freely navigate through the website with bs4. But in the latter's cases, you cannot always use URLs as a rule of thumb.

The site you are going to scrape is built with Angular.js, which is based on client-side rendering. So, the response you get is the JavaScript code, not HTML tags with page content in it. You have to run the code to get the content.

About the code you introduced:

req = Request(softwareadvice, headers={'User-Agent': 'Mozilla/5.0'})
    
web_byte = urlopen(req, timeout=10).read() # response is javascript, not page content you want...
    
webpage = web_byte.decode('utf-8')

All you can get is the JavaScript code that must be run to get HTML elements. That is why you get the same pages(response) every time.

So, what to do? Is there any way to run JavaScript within bs4? I guess there aren't any appropriate ways to do this. You can use selenium for this one. You can literally wait until the page fully loads, you can click buttons and anchors, or get page content at any time.

Headless browsers in selenium might work, which means you don't have to see the controlled browser opening on your computer.

Here are some links that might be of help to you.

scrape html generated by javascript with python

https://sadesmith.com/2018/06/15/blog/scraping-client-side-rendered-data-with-python-and-selenium

Thanks for reading.

Nikita
  • 964
  • 2
  • 14