Working xml data - Extract part of string in Python

Question

I am working on the text data in a string format. I'd like to know how to extract part of the string as below:

data = '<?xml version_history="1.0" encoding="utf-8"?><feed xml:base="https://dummydomain.facebook.com/" xmlns="http://www.w3.org/2005/Atom" xmlns:d="http://schemas.microsoft.com/ado/2008/09/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2008/09/dataservices/metadata" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"><id>aad232-c2cc-42ca-ac1e-e1d1b4dd55de</id><title<d:VersionType>3.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_h84_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>3. Contract Signed<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-07-30T12:15:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Amy, Jackson</d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>2.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>2. Active Discussion<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2020-02-15T18:15:60Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email></d:LookupValue><d:Email>Amy.Jackson@doe.com</d:Email><id>af62fe09-fds2-42ca-a12c1e-e1d1b4dd55de</id><title<d:VersionType>1.0</d:VersionLabel><d:Name>XYZ Company</d:Title><d:New_x005f_x0342fs_x005f_dsad_x005f_x003f_x005f_ m:null="true" /><d:Action_x005f_x0020_x005f_Status>Active<d:Stage>1. Exploratory<d:ComplianceAssetId m:null="true" /><d:ID m:type="Edm.Int32">408</d:ID><d:Modified m:type="Edm.DateTime">2019-07-15T10:20:04Z</d:Modified><d:Author m:type="SP.FieldUserValue"><d:LookupId m:type="Edm.Int32">13</d:LookupId><d:LookupValue> Sam, Joseph</d:LookupValue><d:Email>Sam. Joseph @doe.com</d:Email>'

I want to extract all <d:VersionType>,<d:Name>,<d:Stage>,and <d:Modified m:type="Edm.DateTime">

Expected outputs:

d:VersionType    d:Name        d:Stage         d:Modified m:type="Edm.DateTime"
3.0              XYZ Company   3. Contract     2020-07-30T12:15:04Z
2.0              XYZ Company   2. Contract     2020-02-15T18:15:60Z
1.0              XYZ Company   1. Exploratory  2019-07-15T10:20:04Z

Thanks in advance for your help!

Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. — Wiktor Stribiżew, Jul 10 '20 at 14:49

score 0 · Accepted Answer · answered Jul 10 '20 at 14:52

0

Try using beautiful soup as it lets you parse xml, html and other documents. Such files are already in a specific structure, and you don't have to build a regex expression from scratch, making your job a lot easier.

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')

version_type = [item.text for item in soup.findAll('d:VersionType')] # gives ['3.0', '2.0', '1.0']

Replace d:VersionType with other elements you want (d:Name, d:Stage, ..) to extract their contents as well.

answered Jul 10 '20 at 14:52

Suraj Subramanian

1,790
2
9
28

1

Thank you, Suraj! Glad to learn bs4 can also do the work, and it's easy to implement. – Bangbangbang Jul 10 '20 at 14:57
1

Yes, it helps you make use of the structure in `xml, html` documents. Glad it was of help to you :) – Suraj Subramanian Jul 10 '20 at 14:57

Working xml data - Extract part of string in Python

1 Answers1