4

I would like to search an XML file for multiple tags.

I can evaluate these command separately:

tree.findall('.//title')
tree.findall('.//p')

but how can I evaluate them together in the same time. I am looking for a syntax like .// title or .//p

I tried this command from an SO post

tree.findall('.//(p|title)')

but I get this traceback error SyntaxError: invalid descendant

876868587
  • 2,802
  • 2
  • 16
  • 43

2 Answers2

4

Instead of traversing the tree two times and joining the node sets, it would be better to do one pass looking for a * wild-card tag name and checking for the tag name via self:: (reference):

tree.xpath("//*[self::p or self::title]") 

Demo:

In [1]: from lxml.html import fromstring

In [2]: html = """
    ...: <body>
    ...:     <p>Paragraph 1</p>
    ...:     <div>Other info</div>
    ...:     <title>Title 1</title>
    ...:     <span>
    ...:         <p>Paragraph 2</p>
    ...:     </span>
    ...:     <title>Title 2</title>
    ...: </body>
    ...: """

In [3]: root = fromstring(html)

In [4]: [elm.text_content() for elm in root.xpath("//*[self::p or self::title]")] 
Out[4]: ['Paragraph 1', 'Title 1', 'Paragraph 2', 'Title 2']
alecxe
  • 414,977
  • 106
  • 935
  • 1,083
0

Try

tree.xpath('.//p | .//title')

The result is the union of both node sets.

zx485
  • 24,099
  • 26
  • 45
  • 52
  • 1
    I tried this and I got `ValueError: Empty tag name`. Any thoughts on where this error is coming from? – 876868587 Jul 15 '18 at 15:58
  • The union operator (`|`) is not supported by `findall()`, but it works with `xpath()`. See https://lxml.de/FAQ.html#what-are-the-findall-and-xpath-methods-on-element-tree. – mzjn Jul 17 '18 at 13:51
  • Thanks for mentioning this bug. I corrected the answer accordingly. – zx485 Feb 03 '20 at 17:41