Finding multiple types of tags with lxml findall() with xpath?

Question

I would like to search an XML file for multiple tags.

I can evaluate these command separately:

tree.findall('.//title')
tree.findall('.//p')

but how can I evaluate them together in the same time. I am looking for a syntax like .// title or .//p

I tried this command from an SO post

tree.findall('.//(p|title)')

but I get this traceback error SyntaxError: invalid descendant

please can you add a sample of the xml you are parsing ? – PRMoureu Jul 15 '18 at 16:00 — PRMoureu, Jul 15 '18 at 16:00

score 4 · Accepted Answer · answered Jul 16 '18 at 03:08

Instead of traversing the tree two times and joining the node sets, it would be better to do one pass looking for a * wild-card tag name and checking for the tag name via self:: (reference):

tree.xpath("//*[self::p or self::title]")

Demo:

In [1]: from lxml.html import fromstring

In [2]: html = """
    ...: <body>
    ...:     <p>Paragraph 1</p>
    ...:     <div>Other info</div>
    ...:     <title>Title 1</title>
    ...:     <span>
    ...:         <p>Paragraph 2</p>
    ...:     </span>
    ...:     <title>Title 2</title>
    ...: </body>
    ...: """

In [3]: root = fromstring(html)

In [4]: [elm.text_content() for elm in root.xpath("//*[self::p or self::title]")] 
Out[4]: ['Paragraph 1', 'Title 1', 'Paragraph 2', 'Title 2']

zx485 · Answer 2 · 2020-02-03T17:38:59.240

0

Try

tree.xpath('.//p | .//title')

The result is the union of both node sets.

edited Feb 03 '20 at 17:38

answered Jul 15 '18 at 15:54

zx485

24,099
26
45
52

1

I tried this and I got `ValueError: Empty tag name`. Any thoughts on where this error is coming from? – 876868587 Jul 15 '18 at 15:58
The union operator (`|`) is not supported by `findall()`, but it works with `xpath()`. See https://lxml.de/FAQ.html#what-are-the-findall-and-xpath-methods-on-element-tree. – mzjn Jul 17 '18 at 13:51
Thanks for mentioning this bug. I corrected the answer accordingly. – zx485 Feb 03 '20 at 17:41

Finding multiple types of tags with lxml findall() with xpath?

2 Answers2