3

I am trying to get the text contents of <p> containing a particular text, using Selenium for Python.

My code works for most pages where I deploy it, but not for this particular page and some others I have encountered.

Where the code otherwise returns the text contents of the found <p>, here it does find the element but returns what seems to be an empty string.

What could be causing this?

from selenium import webdriver

driver = webdriver.Firefox()
driver.get("https://advisors.vanguard.com/VGApp/iip/site/advisor/investments/productoverview?fundId=4415")

match_string = "seeks to track the"

elmnt = driver.find_element_by_xpath((
                                    "//*[self::p or self::span or self::div]" +
                                    "[text()[contains(., '%s')]]" % match_string
                                    )).text

print "Result:" + elmnt

Part of page HTML, where I want to get the text in <p>:

<div style="margin:0px;">   
    <h2 style="margin-bottom:8px" class="option1"><!--PPE:Content-188-->Summary of this fund<!--End PPE--></h2>
    <p>Vanguard International Dividend Appreciation ETF seeks to track the performance of a benchmark index that measures the investment return of non-U.S. companies that have a history of increasing dividends.</p>
</div>
P A N
  • 4,217
  • 10
  • 37
  • 82
  • Please include the HTML you are working with – Mo H. Jun 27 '16 at 15:21
  • @MoH. It's linked in the question and in the code. – P A N Jun 27 '16 at 15:22
  • Thats nice, but i'm not clicking a link to an external site, please include all relevant code in your question – Mo H. Jun 27 '16 at 15:23
  • @MoH. Added the part of the code. – P A N Jun 27 '16 at 15:30
  • What is the logic behind the xpath? – Padraic Cunningham Jun 27 '16 at 19:54
  • @PadraicCunningham It looks for any node of the types `p`, `span` or `div` and then returns an element if it contains the text in `match_string`. – P A N Jun 27 '16 at 19:59
  • So any/all of the 3 that contain match_string? – Padraic Cunningham Jun 27 '16 at 20:00
  • @PadraicCunningham Yes, for the particular webpage it will be a `p` node. – P A N Jun 27 '16 at 20:00
  • But only one will match? – Padraic Cunningham Jun 27 '16 at 20:01
  • @PadraicCunningham Yes on this webpage only one will match, because the `match_string` is very particular. But the other nodes are there because the code is deployed on other pages (if you were looking to redact the nodes). I don't know if they are causing the problem, but I don't think so, because the code works on other pages. – P A N Jun 27 '16 at 20:04
  • 1
    Using `//*[self::p or self::span or self::div][text()[contains(., 'seeks to track the')]]/text()`, albeit with requests and lxml I get `'Vanguard International Dividend Appreciation ETF seeks to track the performance of a benchmark index that measures the investment return of non-U.S. companies that have a history of increasing dividends.', 'Vanguard International Dividend Appreciation ETF seeks to track the performance of a benchmark index that measures the investment return of non-U.S. companies that have a history of increasing dividends.'` – Padraic Cunningham Jun 27 '16 at 20:11
  • 1
    It is strange, the tag_name is p, the parent is `div` `elmnt.find_element_by_xpath("./preceding-sibling::h2").get_attribute("class")` shows `option1` so you are definitely finding the correct tag but for some reason selenium is giving you an empty string – Padraic Cunningham Jun 27 '16 at 20:34
  • 1
    Ok, I saved both the source from requests and what driver.page_source returned, the html is broken using with chrome and firefox, using `driver = webdriver.PhantomJS() == No issue`, save it yourself and view it in a good editor with html and you will see red everywhere, also view it in the browser and you see encoding issues. lxml does indeed handle it fine but obviously selenium not so much – Padraic Cunningham Jun 27 '16 at 20:51
  • @PadraicCunningham Thanks for taking a deeper look at this. I'm a bit surprised the error seems to be prevalent, because I've had this error on multiple pages, while the code works on most pages I've tried to scrape. The problem may be more Selenium-related than anything. I won't be able to switch to `PhantomJS` for this project, but maybe for the next :) – P A N Jun 27 '16 at 21:12
  • No worries, it has to be a parsing issue as you do actually get to the tag but for some reason the text that should be present does not seem to be there. You could try another version of firefox, I have an older binary here I will test out and see. – Padraic Cunningham Jun 27 '16 at 21:14
  • Unfortunately the same issue, could you incorporate lxml into your code? Or indeed bs4? – Padraic Cunningham Jun 27 '16 at 21:18
  • @PadraicCunningham Might be able to parse via `bs4`. Haven't tried `lxml` before. If I find a solution I'll try to post something here. – P A N Jun 27 '16 at 21:20
  • 1
    Lxml works for me parsing the source from Firefox and chrome so it might be a more reliable option – Padraic Cunningham Jun 27 '16 at 21:25
  • @Winterflags, can you share another working and another broken site? – Padraic Cunningham Jun 27 '16 at 21:38

0 Answers0