extract text from xml elements using awk

Question

I've a file with ~ 10k of this type of xml tag:

<!-- http://purl.obolibrary.org/obo/HP_0100516 -->

<owl:Class rdf:about="http://purl.obolibrary.org/obo/HP_0100516">
    <obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>
    <oboInOwl:created_by rdf:datatype="http://www.w3.org/2001/XMLSchema#string">doelkens</oboInOwl:created_by>
    <oboInOwl:creation_date rdf:datatype="http://www.w3.org/2001/XMLSchema#string">2010-12-20T10:35:11Z</oboInOwl:creation_date>
    <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">UMLS:C0041955</oboInOwl:hasDbXref>
    <oboInOwl:hasRelatedSynonym rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasia of the ureters</oboInOwl:hasRelatedSynonym>
    <oboInOwl:hasRelatedSynonym>ureter, cancer of</oboInOwl:hasRelatedSynonym>
    <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HP:0100516</oboInOwl:id>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>
</owl:Class>
<owl:Axiom>
    <owl:annotatedSource rdf:resource="http://purl.obolibrary.org/obo/HP_0100516"/>
    <owl:annotatedProperty rdf:resource="http://purl.obolibrary.org/obo/IAO_0000115"/>
    <owl:annotatedTarget rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</owl:annotatedTarget>
    <oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">HPO:probinson</oboInOwl:hasDbXref>
</owl:Axiom>

and I want to convert to a tab delimited text file with only 2 of the xml elements:

Neoplasm of the ureter  The presence of a neoplasm of the ureter

By using awk.

The text I need extract is within these tags:

<obo:IAO_0000115 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">The presence of a neoplasm of the ureter.</obo:IAO_0000115>

and

<rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Neoplasm of the ureter</rdfs:label>

and the awk script I plan to use:

BEGIN{RS="//"}
{
  match($0, regex1 , a)
  match($0, regex2, b)
  print a[1], "\t", b[1]
}

What's the best way to use regex to obtain the text inside the xml elements?

NOTE: this approach has been very useful and demonstrates that awk can be used to extract xml text from complex xml/rdf structures

the final awk script used thanks to @RavinderSingh13:

awk '
/obo:IAO_0000115 rdf:datatype/ && match($0,/>.*</,a){
  gsub(/^>|<$/,"",a[0])
  
}
/rdfs:label rdf:datatype/ && match($0,/>.*</,b){
  gsub(/^>|<$/,"",b[0])
  print b[0]"\t"a[0]
}
'  file.xml > output.txt

The *best* way is to use XSLT or XML-aware tools instead of trying to hack up something with regular expressions. — Shawn, Aug 26 '20 at 09:24
could you please do let me know if you want to get values only for tags `obo:IAO_0000115` and `rdfs:label`? BTW its recommended by experts to use xmlatrlet xml related tools but if you can't install them then we could go ahead with `awk` solutions, kindly confirm the same once. — RavinderSingh13, Aug 26 '20 at 09:27

RavinderSingh13 · Accepted Answer · 2020-08-26T11:53:51.000

Could you please try following, based on your shown samples only. Also awk is not an ideal tool for xml parsing since OP mentioned specifically OP can't use any other tools so going with this approach here.

awk '
(/obo:IAO_0000115 rdf:datatype/ || /rdfs:label rdf:datatype/) && match($0,/>.*</){
  print substr($0,RSTART+1,RLENGTH-2)
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                         ####Starting awk program from here.
(/obo:IAO_0000115 rdf:datatype/ || /rdfs:label rdf:datatype/) && match($0,/>.*</){    ####Chcecking condition if line contains obo:IAO_0000115 rdf:datatype OR rdfs:label rdf:datatype AND matches everythig from > to till < in current line.
  print substr($0,RSTART+1,RLENGTH-2)         ####Printing sub-string from RSTART to till RLENGTH here, where RSTART and RLENGTH variables are set whenever a match function has TRUE/matched regex in it.
}
'  Input_file                                 ####Mentioning Input_file here.

From man awk:

RSTART The index of the first character matched by match(); 0 if no match. (This implies that character indices start at one.) RLENGTH The length of the string matched by match(); -1 if no match.

EDIT: Adding 1 more solution as per OP's comment in case someone wants to create 2 different arrays out of 2 different string searches then try following. Written and tested in GNU awk.

awk '
/obo:IAO_0000115 rdf:datatype/ && match($0,/>.*</,a){
  gsub(/^>|<$/,"",a[0])
  print a[0]
}
/rdfs:label rdf:datatype/ && match($0,/>.*</,b){
  gsub(/^>|<$/,"",b[0])
  print b[0]
}
'  Input_file

extract text from xml elements using awk

1 Answers1