1

My goal is to get a list of the names of all the new items that have been posted on https://www.prusaprinters.org/prints during the full 24 hours of a given day.

Through a bit of reading I've learned that I should be using Selenium because the site I'm scraping is dynamic (loads more objects as the user scrolls).

Trouble is, I can't seem to get anything but an empty list from webdriver.find_elements_by_ with any of the suffixes listed at https://selenium-python.readthedocs.io/locating-elements.html.

On the site, I see "class = name" and "class = clamp-two-lines" when I inspect the element I want to get the title of (see screenshot), but I can't seem to return a list of all the elements on the page with that name class or the clamp-two-lines class.

prusaprinters inspect element

Here's the code I have so far (the lines commented out are failed attempts):

from timeit import default_timer as timer
start_time = timer()
print("Script Started")

import bs4, selenium, smtplib, time
from bs4 import BeautifulSoup 
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome(r'D:\PortableApps\Python Peripherals\chromedriver.exe')

url = 'https://www.prusaprinters.org/prints'
driver.get(url)
# foo = driver.find_elements_by_name('name')
# foo = driver.find_elements_by_xpath('name')
# foo = driver.find_elements_by_class_name('name')
# foo = driver.find_elements_by_tag_name('name')
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[class*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=clamp-two-lines]')]
# foo = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="printListOuter"]//ul[@class="clamp-two-lines"]/li')))
print(foo)
driver.quit()

print("Time to run: " + str(round(timer() - start_time,4)) + "s")

My research:

  1. Selenium only returns an empty list
  2. Selenium find_elements_by_css_selector returns an empty list
  3. Web Scraping Python (BeautifulSoup,Requests)
  4. Get HTML Source of WebElement in Selenium WebDriver using Python
  5. How to get Inspect Element code in Selenium WebDriver
  6. Web Scraping Python (BeautifulSoup,Requests)
  7. https://chrisalbon.com/python/web_scraping/monitor_a_website/
  8. https://www.codementor.io/@gergelykovcs/how-and-why-i-built-a-simple-web-scrapig-script-to-notify-us-about-our-favourite-food-fcrhuhn45
  9. https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dynamic_websites.htm
TempleGuard527
  • 383
  • 3
  • 12
  • your last try looks right except it's a span tag not "ul/li"... that will return the element, and then use text() to get the text. – pcalkins Jan 22 '20 at 21:37

2 Answers2

1

This is xpath of the name of the items:

.//div[@class='print-list-item']/div/a/h3/span
Pratik
  • 337
  • 1
  • 9
  • 1
    Pratik, thank you for posting this. It helped me to understand what xpath really means! Both answers give the same output, but I'm picking the other answer because he included how to print the output. Without that step, I still had confusing strings I didn't understand. Thank you! – TempleGuard527 Jan 22 '20 at 21:57
1

To get text wait for visibility of the elements. Css selector for titles is #printListOuter h3:

titles = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))

for title in titles:
    print(title.text)

Shorter version:

wait = WebDriverWait(driver, 10)
titles = [title.text for title in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))]
Sers
  • 10,960
  • 2
  • 8
  • 25
  • Thanks for your help. So that's why `titles = driver.find_elements_by_css_selector('#printListOuter h3')` doesn't work? (I tried it after seeing your answer). It doesn't wait for the page to load? – TempleGuard527 Jan 22 '20 at 21:59
  • Does it work? Wait only for visibility of the specified elements and not for complete page – Sers Jan 22 '20 at 22:02
  • Your answer works perfectly! As an experiement, I tried driver.find_elements_by_css_selector('printListOuter h3') without the Wait command, and that failed. – TempleGuard527 Jan 23 '20 at 15:24