0

I want to remove the <br> html tag while web scraping the page, but replace doesn't seem to work. i'm not sure if there is another way to do it or better way to do it using selenium and python. thank you in advance.

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("drivers/chromedriver")

driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")

state_drop = driver.find_element_by_id("state")
state = Select(state_drop)
state.select_by_visible_text("New Hampshire")

driver.find_element_by_id("city").send_keys("Moultonborough")
driver.find_element_by_id("name").send_keys("Moultonborough Academy")
driver.find_element_by_class_name("forms_input_button").send_keys(Keys.RETURN)
driver.find_element_by_id("hsSelectRadio_1").click()

courses_subheading = driver.find_elements_by_tag_name("th.header")

print(courses_subheading[0].text, "     " ,courses_subheading[1].text, "     ", courses_subheading[2].text, "     ", courses_subheading[3].text, "     ", courses_subheading[4].text

I tried this:

for i in courses_subheading:
    courses_subheading.replace("<br>", " ")

but get an error: AttributeError: 'list' object has no attribute 'replace'

currently, it looks like this:

Course
Weight     Title     Notes     Max
Credits       OK
Through       Disability
Course

but i want it like this:

Course Weight     Title     Notes     Max Credits     OK     Through     Disability Course
DebanjanB
  • 118,661
  • 30
  • 168
  • 217
J. Doe
  • 233
  • 1
  • 6
  • Hi, have a look here: https://stackoverflow.com/questions/24201926/in-place-replacement-of-all-occurrences-of-an-element-in-a-list-in-python your loop to replace
    was close, you just need to use the iterator not the list
    – RichEdwards Aug 03 '20 at 12:15
  • 1
    in your loop instead of : `courses_subheading.replace("
    ", " ")` use: `i.replace("
    ", " ")`
    – Tasnuva Aug 03 '20 at 12:21
  • @J.Doe This sounds like an [X-Y problem](http://xyproblem.info/). Instead of asking for help with your solution to the problem, edit your question and ask about the actual problem. What are you trying to do? – DebanjanB Aug 03 '20 at 12:28
  • thank you for catching that, but i'm still getting an error regarding attribute ```AttributeError: 'WebElement' object has no attribute 'replace'```. also, i gave the possible solution to give context to the problem – J. Doe Aug 03 '20 at 12:35

2 Answers2

0

Instead of removing the <br> you can easily avoid the <br> tags. To print the table headers, e.g. Title, Notes, etc, you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using css_selector :

    driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
    Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire")
    driver.find_element_by_css_selector("input#city").send_keys("Moultonborough")
    driver.find_element_by_css_selector("input#name").send_keys("Moultonborough Academy")
    driver.find_element_by_css_selector("input[value='Search']").click()
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='hsCode']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table#approvedCourseTable_1 th.header")))])
    
  • Using xpath :

    driver.get("https://web3.ncaa.org/hsportal/exec/hsAction")
    Select(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.ID, "state")))).select_by_visible_text("New Hampshire")
    driver.find_element_by_xpath("//input[@id='city']").send_keys("Moultonborough")
    driver.find_element_by_xpath("//input[@id='name']").send_keys("Moultonborough Academy")
    driver.find_element_by_xpath("//input[@value='Search']").click()
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//input[@name='hsCode']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@id='approvedCourseTable_1']//th[@class='header']")))])
    
  • Console Output:

    ['Course\nWeight', 'Title', 'Notes', 'Max\nCredits', 'OK\nThrough', 'Disability\nCourse']
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
DebanjanB
  • 118,661
  • 30
  • 168
  • 217
0

To complete, if you really want to remove the br tags, you can use (I've fixed your XPath expression) :

import re
courses_subheading = driver.find_elements_by_xpath("(//tr[th[@class='header']])[1]/th")
headers = [re.sub('\s+',' ',el.text) for el in courses_subheading]
print(headers)

Output :

['Course Weight', 'Title', 'Notes', 'Max Credits', 'OK Through', 'Disability Course']
E.Wiest
  • 5,122
  • 2
  • 4
  • 11