1

What I want to do is to scrape the following site https://wiki.openstreetmap.org/wiki/Key:office and specifically the table containing all the tags so everything contained within:

<table class="wikitable taginfo-taglist">...<\table>

since everything within:

<div class="taglist" ...> ... <\div>

(the parrent of the table) is generated by JavaScript I thought this code could work:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
    
options = Options()
options.add_argument("--headless")
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
driver = webdriver.Firefox(options=options, capabilities=caps, executable_path='../statics/geckodriver')
    
    
def get_tag_soup(url):
    driver.get(url)
    try:
        table = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME , "wikitable taginfo-taglist")))
        soup = BeautifulSoup(table.get_attribute('innerHTML'), 'lxml') 
    except Exception as e:
        soup = e
    
    return soup 

get_tag_soup('https://wiki.openstreetmap.org/wiki/Key:office')

But when I run this code I just get an selenium.common.exceptions.TimeoutException('', None, None) more frustratingly some times if I WebDriverWait for the parent of "wikitable taginfo-taglist" with EC.presence_of_element_located((By.CLASS_NAME , "taglist")) it works.

barny
  • 5,280
  • 4
  • 16
  • 21
Thagor
  • 664
  • 8
  • 24

1 Answers1

1

To extract the table containing all the tags instead of presence_of_element_located() you have to induce WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategies:

  • Using CSS_SELECTOR:

    driver.get("https://wiki.openstreetmap.org/wiki/Key:office")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.wikitable.taginfo-taglist"))).text)
    
  • Using XPATH:

    driver.get("https://wiki.openstreetmap.org/wiki/Key:office")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='wikitable taginfo-taglist']"))).text)
    
  • Console Output:

    Key Value Element Description Map rendering Image Count
    office accountant An office for an accountant.
    6 895
    1 967
    14
    office advertising_agency A service-based business dedicated to creating, planning, and handling advertising.
    3 916
    580
    3
    office architect An office for an architect or group of architects.
    5 715
    1 239
    12
    office association An office of a non-profit organisation, society, e.g. student, sport, consumer, automobile, bike association, etc.
    13 054
    3 286
    50
    office charity An office of a charitable organization
    696
    384
    7
    office company An office of a private company
    129 801
    36 951
    608
    office consulting An office for a consulting firm, providing expert professional advice to other companies or organisations.
    1 341
    162
    4
    office coworking An office where people can go to work (might require a fee); not limited to a single employer
    1 297
    320
    7
    office diplomatic
    6 634
    4 065
    95
    office educational_institution An office for an educational institution.
    14 172
    8 563
    175
    office employment_agency An office for an employment service.
    7 300
    1 771
    43
    office energy_supplier An office for a energy supplier.
    2 237
    1 112
    19
    office engineer An office for an engineer or group of engineers.
    454
    98
    2
    office estate_agent A place where you can rent or buy a house.
    44 813
    8 042
    39
    office financial An office of a company in the financial sector
    4 891
    1 588
    24
    office forestry A forestry office
    523
    741
    9
    office foundation An office of a foundation
    1 757
    542
    10
    office government An office of a (supra)national, regional or local government agency or department
    98 289
    70 569
    2 300
    office guide An office for tour guides, mountain guides, dive guides, etc.
    587
    168
    1
    office insurance An office at which you can take out insurance policies.
    34 693
    6 475
    91
    office it An office for an IT specialist.
    9 486
    2 039
    51
    office lawyer An office for a lawyer.
    22 881
    4 841
    22
    office logistics An office for a forwarder / hauler.
    2 796
    677
    8
    office moving_company An office which offers a relocation service.
    605
    252
    4
    office newspaper An office of a newspaper
    3 511
    1 450
    27
    office ngo An office for a non-profit, non-governmental organisation (NGO).
    12 693
    3 565
    58
    office notary An office for a notary public (common law)
    3 860
    548
    9
    office political_party An office of a political party
    3 354
    1 017
    8
    office property_management Office of a company, which manages a real estate property.
    796
    162
    2
    office quango An office of a quasi-autonomous non-governmental organisation.
    366
    233
    4
    office religion office of a community of faith
    5 807
    2 172
    43
    office research An office for research and development
    3 667
    4 545
    348
    office surveyor An office of a person doing surveys, this can be risk and damage evaluations of properties and equipment, opinion surveys or statistics.
    451
    109
    1
    office tax_advisor An office for a financial expert specially trained in tax law
    5 053
    823
    4
    office telecommunication An office for a telecommunication company
    16 968
    4 335
    77
    office visa An office of an organisation or business which offers visa assistance
    95
    1
    0
    office water_utility The office for a water utility company or water board.
    743
    908
    20
    office yes Generic tag for unspecified office type.
    27 434
    36 155
    420
    

Note: Do ensure you have maximized the browser Viewport as follows:

options.add_argument("start-maximized")
DebanjanB
  • 118,661
  • 30
  • 168
  • 217
  • Thx for the awnser but both xpath and css selector for me produce the same timeout error. maybe the issues is that the driver isn't rendering the javascript? – Thagor Feb 05 '21 at 10:53
  • @Thagor Checkout the updated answer and let me know the status. – DebanjanB Feb 05 '21 at 11:00
  • sadly it does not solve the issue I tried waiting for 120 seconds which doesn't help either and it tried setting an window size which does nothing as well. – Thagor Feb 05 '21 at 11:06
  • @Thagor Can you just copy and paste my code and retest please? – DebanjanB Feb 05 '21 at 11:18
  • I played a bit more with the code and found that when I do `time.sleep(10)` the table gets rendered but (By.CSS_SELECTOR, "wikitable taginfo-taglist") only works for `"taglist"` not for ` "table.wikitable.taginfo-taglist"` here the code times out – Thagor Feb 05 '21 at 11:19
  • 1
    okay tried it @DebanjanB and the CSS_selector works! `"wikitable taginfo-taglist"` wasn't the right selector – Thagor Feb 05 '21 at 11:20
  • @Thagor Glad to be able to help you. Please [_accept_](https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work) the _answer_ by clicking on the hollow tick mark beside my _answer_ which is just below the _votedown_ arrow, so the tick mark turns _green_. – DebanjanB Feb 05 '21 at 11:21