1

I am trying to scrape a website using Selenium Firefox (headless) driver in python.

I read all the anchors in the webpage and go through them all one by one. But I want for the browser to wait for the Ajax calls on the page to be over before moving to another page.

My code is the following:

import time 
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

caps = DesiredCapabilities().FIREFOX
caps["pageLoadStrategy"] = "eager"  #  complete

options = Options()
options.add_argument("--headless")

url = "http://localhost:3000/"

# Using Selenium's webdriver to open the page
driver = webdriver.Firefox(desired_capabilities=caps,firefox_options=options)
driver.get(url)
urls = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, "a")))

links = []

for url in urls:
    links.append(url.get_attribute("href"))

for link in links:
    print 'navigating to: ' + link
    driver.get(link)
    body = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, "p")))
    driver.execute_script("window.scrollTo(0,1000);")
    print(body)    
    driver.back()

driver.quit()

the line print(body) was added for testing purposes. and it returned uncomprehensible text , instead of the actual HTML of the page. Heres a part of the printed text :

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="e7dfa6b2-1ddf-438d-b562-1e2ac8416e07")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="6fe1ffb0-17a8-4b64-9166-691478a0bbd4")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="1f510a00-a587-4ae8-9ecf-dd4c90081a5a")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="c1bfb1cd-5ccf-42b6-ad4c-c1a70486cc98")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="be44db09-3948-48f1-8505-937db509a157")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="68f3c9f2-80b0-493e-a47f-ad69caceaa06")>, 

What is causing this?

Everything (content related) in the pages i'm scraping is static.

Andrei Suvorkov
  • 5,191
  • 4
  • 17
  • 37
HelpASisterOut
  • 2,763
  • 10
  • 38
  • 74

2 Answers2

2

Try this:

for node in body: 
    print(node.get_attribute('innerHTML')) 

this will print innerHTML as a string.

Andrei Suvorkov
  • 5,191
  • 4
  • 17
  • 37
  • 1
    Note that `body` is actually a `list`. The working code should be `for node in body: print(node.get_attribute('innerHTML')) ` – Andersson Jul 13 '18 at 13:20
2

As per your current code trial the output you are seeing is pretty much justified.

presence_of_all_elements_located(locator)

presence_of_all_elements_located() is the expectation for checking that there is at least one element present on a web page. A Locator Strategy is used to find the elements and returns the List of WebElements once they are located.

As you invoked:

body = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, "p")))

Now, body contains List of WebElements. So when you invoked:

print(body) 

The reference of the elements gets printed to the console as follows:

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="e7dfa6b2-1ddf-438d-b562-1e2ac8416e07")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="6fe1ffb0-17a8-4b64-9166-691478a0bbd4")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="1f510a00-a587-4ae8-9ecf-dd4c90081a5a")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="c1bfb1cd-5ccf-42b6-ad4c-c1a70486cc98")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="be44db09-3948-48f1-8505-937db509a157")>, <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="fb183e8b-ce36-47e7-a03e-d3aeea376304", element="68f3c9f2-80b0-493e-a47f-ad69caceaa06")>]

A lot depends on what exactly you are trying to print. As you have decided to collect the elements with <p> tag, possibly you may desire to print the text within. In that case you need to scrollIntoView the element and then print the innerHTML as follows:

body = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.TAG_NAME, "p")))
for element in body:
    driver.execute_script("return arguments[0].scrollIntoView(true);", element)
    print(element.get_attribute("innerHTML"))
DebanjanB
  • 118,661
  • 30
  • 168
  • 217