extract text from xml documents in python

Question

This is the sample xml document :

<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>

I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. For this to happen I should know where the text lies without knowing about the element. One more thing that all these documents are different.

Please Help!!

score 2 · Answer 1 · answered Jul 01 '12 at 04:57

2

Using the lxml library with an xpath query is possible:

xml="""<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>
"""
from lxml import etree
root = etree.fromstring(xml).getroot()
root.xpath('/bookstore/book/*/text()')
# ['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J K. Rowling ', '2005', '625.00']

Although you don't get the category....

answered Jul 01 '12 at 04:57

Jon Clements

124,071
31
219
256

What if I need to do this parsing method in xml document since the answer that you gave it is using xml string. Please Reply. – POOJA GUPTA Jul 01 '12 at 05:02
1

Okay... my reply is that your comment doesn't make terrific sense? – Jon Clements Jul 01 '12 at 05:05
1

Your **input** is XML. Thus copying/pasting as a string makes sense for demonstration purposes. – Jon Clements Jul 01 '12 at 05:22

score 0 · Answer 2 · edited May 23 '17 at 12:20

If you want to call grep from inside python, see the discussion here, especially this post.

If you want to search through all the files in a directory you could try something like this using the glob module:

import glob    
import os    
import re    

p = re.compile('>.*<')    
os.chdir("./")    
for files in glob.glob("*.xml"):    
    file = open(files, "r")    
    line = file.read()    
    list =  map(lambda x:x.lstrip('>').rstrip('<'), p.findall(line))    
    print list    
    print

This searches iterates through all the files in the directory, opens each file and exteacts text matching the regexp.

Output:

['Everyday Italian', 'Giada De Laurentiis', '2005', '300.00', 'Harry Potter', 'J
 K. Rowling ', '2005', '625.00']

EDIT: Updated code to extract only the text elements from the xml.

Burhan Khalid · Accepted Answer · 2012-07-01T05:08:57.573

-1

You could simply strip out any tags:

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

But if you just want to search files for some text in Linux, you can use grep:

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

If you want to search in a file, use the grep command above, or open the file and search for it in Python:

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

edited Jul 01 '12 at 05:08

answered Jul 01 '12 at 04:36

Burhan Khalid

152,028
17
215
255

Is there way to use grep inside another file though I know its a command to be written on terminal . Just to know in general – POOJA GUPTA Jul 01 '12 at 04:51
hey its not working because the example that you gave is of xml string . what if I need to do this in xml file since its xml file from which I have to extract and not from xml string? – POOJA GUPTA Jul 01 '12 at 04:58
@POOJAGUPTA No this looks for "Harry Potter" inside the file that's called 'file.xml'... The XML string is the output from grep... – Jon Clements Jul 01 '12 at 05:01
@POOJAGUPTA, check out my answer if you want to parse through all your xml files or if you want to invoke grep from within python. – Bharat Jul 01 '12 at 05:03
@Jon:I don't want output from grep . and are'nt you sending argument "xml" in fromstring(). You are parsing the section of an xml document, if I'm not wrong? PLease explain, see I'm new to xml parsing, so please tell me If I'm wrong somewhere. – POOJA GUPTA Jul 01 '12 at 05:13
@RBK: it's not working I wrote this piece of code : exp = re.compile(r '<.>') txt = "books.xml" text_only = exp.sub('',txt).strip() print text_only – POOJA GUPTA Jul 01 '12 at 05:17

extract text from xml documents in python

3 Answers3

Linked

Related