0

I am working on a school project and want to get all user reviews of superhero movies of IMDB.

First, I try to get all user reviews of only 1 movie.

The page of user reviews, consists of 25 user reviews and a 'load more' button. While I already managed to write a code to open the load more button. I get stuck in the second part: getting all user reviews in a list.

I already tried to use BeautifulSoup to find all 'content' parts on the page. However, my list remains empty.

from bs4 import BeautifulSoup
testurl = "https://www.imdb.com/title/tt0357277/reviews?ref_=tt_urv"
patience_time1 = 60
XPATH_loadmore = "//*[@id='load-more-trigger']"
XPATH_grade = "//*[@class='review-container']/div[1]"
list_grades = []

driver = webdriver.Firefox()
driver.get(testurl)

# This is the part in which I open all 'load more' buttons.
while True:
    try:
        loadmore = driver.find_element_by_id("load-more-trigger")
        time.sleep(2)
        loadmore.click()
        time.sleep(5)
    except Exception as e:
        print(e)
        break
    print("Complete")
    time.sleep(10)

    # When the whole page is loaded, I want to get all 'content' parts.
    soup = BeautifulSoup(driver.page_source)
    content = soup.findAll("content")
    list_content = [c.text_content() for c in content]

driver.quit()

I expect to get a list of all content of the review-containers on the website. However, my list remains empty.

Marieke
  • 9
  • 1
  • Did you take a look at what requests happen when you click load more? It could be way easier to replicate the request instead. – antfuentes87 May 22 '19 at 16:32
  • i'm seeing `name 'webdriver' is not defined` when running your code locally. can you provide a `requirements.txt`? – XoXo May 22 '19 at 16:36
  • @Jeff Xiao I imported the following modules: from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.common.exceptions import NoSuchElementException import time – Marieke May 22 '19 at 17:16
  • @Marieke i've added an answer. another note is you may need to tweak the sleep time, currently it's unnecessarily long on my machine. – XoXo May 22 '19 at 17:40
  • 1
    https://stackoverflow.com/search?q=imdb+review – QHarr May 22 '19 at 17:48
  • Try searchin on `[web-scraping] [beautifulsoup] infinite`. Good luck. – shellter May 22 '19 at 19:20

1 Answers1

0

You use BeautifulSoup4, correct?

Method names changed from 3 to 4. (document)

Also, find_all takes the tag name, and an optional class_ param for the css class (see this SO answer)

So your code should be using the new name:

    # content = soup.findAll("content")
    content = soup.find_all('div', class_=['text','show-more__control'])

Also use get_text() in your list-comprehension:

# list_content = [c.text_content() for c in content]
list_content = [tag.get_text() for tag in content]

Lastly, provide a parser when getting the soup: (document)

    soup = BeautifulSoup(driver.page_source, features="html.parser")

Otherwise you will encounter this UserWarning:

SO56261323.py:36: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

XoXo
  • 1,438
  • 1
  • 13
  • 29