0

I am writing a python selenium scrip to try and extract the URL links for LinkedIn profiles in a google search but I am having problems narrowing down my XPath to only returning the search results links on google.

linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[@href]')
for linkedin_url in linkedin_urls:
    url = linkedin_url.get_attribute("href")
    print(url)

    driver.get(url)
    sleep(5)

The results from linkedin_urls give me

https://uk.linkedin.com/in/roxana-andreea-popescu
https://uk.linkedin.com/in/tunjijabitta
https://www.google.com/search?source=hp&ei=bxjhX4uGC4_ykgXl9pu4Bw&q=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&oq=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&gs_lcp=CgZwc3ktYWIQDFDMZFjhZmCwZ2gAcAB4AIABLogBsAGSAQE0mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjL-dn4huDtAhUPuaQKHWX7BncQ4dUDCA0#
https://www.google.com/search?q=related:https://uk.linkedin.com/in/tunjijabitta&sa=X&ved=2ahUKEwji3qP_huDtAhWAZxUIHTyfAO4QHzABegQIBhAH
https://uk.linkedin.com/in/janomer
https://uk.linkedin.com/in/josephcoker
https://uk.linkedin.com/in/sebemin
https://uk.linkedin.com/in/vicki-marshall-b7433827
https://www.google.com/search?source=hp&ei=bxjhX4uGC4_ykgXl9pu4Bw&q=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&oq=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&gs_lcp=CgZwc3ktYWIQDFDMZFjhZmCwZ2gAcAB4AIABLogBsAGSAQE0mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjL-dn4huDtAhUPuaQKHWX7BncQ4dUDCA0#
https://www.google.com/search?q=related:https://uk.linkedin.com/in/vicki-marshall-b7433827&sa=X&ved=2ahUKEwji3qP_huDtAhWAZxUIHTyfAO4QHzAFegQIARAH
https://uk.linkedin.com/in/andreibodnar
https://www.google.com/search?q=related:https://uk.linkedin.com/in/andreibodnar&sa=X&ved=2ahUKEwji3qP_huDtAhWAZxUIHTyfAO4QHzAGegQIBxAH
https://uk.linkedin.com/in/dmrlawson
https://uk.linkedin.com/in/jack-gilbert-541a251b
https://www.google.com/search?source=hp&ei=bxjhX4uGC4_ykgXl9pu4Bw&q=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&oq=site%3Alinkedin.com%2Fin%2F+AND+%22Software+Developer%22+AND+%22London%22&gs_lcp=CgZwc3ktYWIQDFDMZFjhZmCwZ2gAcAB4AIABLogBsAGSAQE0mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjL-dn4huDtAhUPuaQKHWX7BncQ4dUDCA0#
https://www.google.com/search?q=related:https://uk.linkedin.com/in/jack-gilbert-541a251b&sa=X&ved=2ahUKEwji3qP_huDtAhWAZxUIHTyfAO4QHzAIegQICxAH
https://uk.linkedin.com/in/eren-batu-999068185

I am trying to find a way to narrow the search to only the LinkedIn results

Emmanuel
  • 21
  • 3

3 Answers3

0

If you want to only get LinkedIn result use below xpath.

Use contains()

linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[contains(@href,"https://uk.linkedin.com")]')

Or starts-with()

Use

linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[starts-with(@href,"https://uk.linkedin.com")]')
KunduK
  • 26,790
  • 2
  • 10
  • 32
  • Hey @Kunduk , I tried using the `starts-with()` but I am getting the error after the scripts gets the first LinkedIn URL. `linkedin_urls = driver.find_elements_by_xpath('//div[@class="yuRUbf"]//a[starts-with(@href,"https://uk.linkedin.com")]') print(linkedin_urls) sleep(0.5) for linkedin_url in linkedin_urls: url = linkedin_url.get_attribute("href") print(url) driver.get(url) sleep(5) sel = Selector(text=driver.page_source) ` – Emmanuel Dec 22 '20 at 16:44
  • The error being: Traceback (most recent call last): File "c:\Users\emman\Documents\Final_Year_Project\LinkedinWebDriver.py", line 50, in url = linkedin_url.get_attribute("href") File "C:\Users\emman\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webelement.py", line 139, in get_attribute attributeValue = self.parent.execute_script( File "C:\Users\emman\AppData\Local\Programs\Python\Python39\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 634, in execute_script r – Emmanuel Dec 22 '20 at 16:46
0

You want to parse each string in linkedin_url to see if it mentions Linkedin.

    if 'linkedin' in linkedin_url:
        print('linkedin')

Basically, put the driver code that you want to be executed on Linkedin addressesd under the if statement.

0

To restrict the search to only the LinkedIn results you you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.yuRUbf a[href^='https://uk.linkedin.com/in']")))])
    
  • Using XPATH

    print([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class="yuRUbf"]//a[starts-with(@href, 'https://uk.linkedin.com/in')]")))])
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
DebanjanB
  • 118,661
  • 30
  • 168
  • 217