I've a file with ~ 10k of this type of xml tag:
<!-- http://purl.obolibrary.org/obo/HP_0100516 -->
<owl:Class rdf:about="http://purl.obolibrary.org/obo/HP_0100516">
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>
<oboInOwl:created_by rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doelkens</oboInOwl:created_by>
<oboInOwl:creation_date rdf:datatype="http://www.w3.org/2001/XMLSchema#string">2010-12-20T10:35:11Z</oboInOwl:creation_date>
<oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">UMLS:C0041955</oboInOwl:hasDbXref>
<oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasia of the ureters</oboInOwl:hasRelatedSynonym>
<oboInOwl:hasRelatedSynonym>ureter, cancer of</oboInOwl:hasRelatedSynonym>
<oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HP:0100516</oboInOwl:id>
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>
</owl:Class>
<owl:Axiom>
<owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/HP_0100516"/>
<owl:annotatedProperty rdf:resource="http://purl.obolibrary.org/obo/IAO_0000115"/>
<owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</owl:annotatedTarget>
<oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HPO:probinson</oboInOwl:hasDbXref>
</owl:Axiom>
and I want to convert to a tab delimited text file with only 2 of the xml elements:
Neoplasm of the ureter The presence of a neoplasm of the ureter
By using awk.
The text I need extract is within these tags:
<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>
and
<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>
and the awk script I plan to use:
BEGIN{RS="//"}
{
match($0, regex1 , a)
match($0, regex2, b)
print a[1], "\t", b[1]
}
What's the best way to use regex to obtain the text inside the xml elements?
NOTE: this approach has been very useful and demonstrates that awk can be used to extract xml text from complex xml/rdf structures
the final awk script used thanks to @RavinderSingh13:
awk '
/obo:IAO_0000115 rdf:datatype/ && match($0,/>.*</,a){
gsub(/^>|<$/,"",a[0])
}
/rdfs:label rdf:datatype/ && match($0,/>.*</,b){
gsub(/^>|<$/,"",b[0])
print b[0]"\t"a[0]
}
' file.xml > output.txt