0

I am using org.jdom2.xpath to evaluate XPath queries on html documents. Attempting to retreive a script text from the head element, I tried this query:

/html/head/script[contains(text(), 'expression1') and contains(text(), 'expression2')]/text()

This query returns a single result in both XPath Helper and Chrome console ($x queries) but retuns an empty result set using org.jdom2.xpath.

Trying the simpler (but heavier) query:

//script[contains(text(), 'expression1') and contains(text(), 'expression2')]/text()

produces the same results.

Code sample:

String xpath = "/html/head/script[contains(text(), 'expression1') and contains(text(), 'expression2')]/text()";
List<Text> tokeScriptResults = (List<Text>) xpathFactory.compile(xpath).evaluate(document);

Afterthought: looking at the Document object, I see that since the script text is very long that jdom2 split it into an array of Texts instead of one long Text. Could this be the issue?

Community
  • 1
  • 1
eladidan
  • 2,564
  • 2
  • 22
  • 37

1 Answers1

2

Short answer - use . instead of text(), i.e. contains(., 'expression1')

Longer answer - text() is a path step that selects the set of all text nodes that are immediate children of the context node. The contains function expects it's arguments to be strings, not node sets, and the rule to convert a node set to a string in XPath 1.0 is to take the string value of the first node in the set in document order and ignore the other nodes completely. Therefore the test contains(text(), 'expression1') only looks in the first text node child.

If instead you do contains(., 'expression1') then the first argument is a set containing a single node (the script element), and the string value of an element node is the concatenation of all its descendant text nodes in document order. So this will look in all the text under the script tag, not just the first text node child.

In general you should very rarely need to use text() in XPath. It is only required when you absolutely must handle each separate text node individually. In predicates I find testing the string value of an element node almost always captures the intention better.

Ian Roberts
  • 114,808
  • 15
  • 157
  • 175