I'm trying to spider a page for links with a specific CSS class with Selenium for Python 3. For some reason it just stops, when it should loop through again
def spider_me_links(driver, max_pages, links):
page = 1 # NOTE: Change this to start with a different page.
while page <= max_pages:
url = "https://www.example.com/home/?sort=title&p=" + str(page)
driver.get(url)
# Timeout after 2 seconds, and duration 5 seconds between polls.
wait = WebDriverWait(driver, 120, 5000)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'card-details')))
# Obtain source text
source_code = driver.page_source
soup = BeautifulSoup(source_code, 'lxml')
print("findAll:", len(soup.findAll('a', {'class' : 'card-details'}))) # returns 12 at every loop iteration.
links += soup.findAll('a', {'class' : 'card-details'})
page += 1
The two lines I think I have it wrong on are the following:
# Timeout after 2 seconds, and duration 5 seconds between polls.
wait = WebDriverWait(driver, 120, 5000)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'card-details')))
Because during that point I'm waiting for content to be loaded dynamically with Ajax, and the content loads fine. If I don't use the function to load it and I don't run the above two lines, I'm able to grab the <a>
tags, but if I put it in the loop it just gets stuck.
I looked at the documentation for the selenium.webdriver.support.expected_conditions
class (the EC
object in my code above), and I'm fairly unsure about which method I should use to make sure the content has been loaded before scraping it with BS4.