0

I got English sentences whose words are XML-tagged, for example:

<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.

There are exactly those three possibilities for xml tags as the sentence shows (<XXX>, <YYY>, <ZZZ>). The word count inside any of those tags can be infinite.

I need to split them at whitespaces ignoring whitespaces inside those XML tags. The code looks like:

String mySentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
String[] mySentenceSplit = mySentence.split("someUnknownRegex");
for (int i = 0; i < mySentenceSplit.length; i++) {
    System.out.println(mySentenceSplit[i]);
}

Specifically for the example above the output should be like:

mySentenceSplit[0] = <XXX>word1</XXX>
mySentenceSplit[1] = word2 
mySentenceSplit[2] = word3 
mySentenceSplit[3] = <YYY>word4 word5 word6</YYY>
mySentenceSplit[4] = word7 
mySentenceSplit[5] = word8 
mySentenceSplit[6] = word9 
mySentenceSplit[7] = word10
mySentenceSplit[8] = <ZZZ>word11 word12</ZZZ>.

What do i have to insert into "someUnknownRegex" to achieve this ?

Pshemo
  • 113,402
  • 22
  • 170
  • 242
kiltek
  • 2,953
  • 5
  • 40
  • 66
  • When your regex gets as complicated as the answers (admittedly not too bad), I wonder if it might be better to just roll your own parser. – ArtOfWarfare Feb 08 '14 at 15:04

3 Answers3

2

Using capturing group and backreference:

String sentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
Pattern pattern = Pattern.compile("<(\\w+)[^>]*>.*?</\\1>\\.?|\\S+");
Matcher matcher = pattern.matcher(sentence);

while (matcher.find()) {
    System.out.println(matcher.group());
}

output:

<XXX>word1</XXX>
word2
word3
<YYY>word4 word5 word6</YYY>
word7
word8
word9
word10
<ZZZ>word11 word12</ZZZ>.
falsetru
  • 314,667
  • 49
  • 610
  • 551
  • 1
    You beat me to the post. Worth noting that it would become an extremely complicated expression with a `split` call - it would need to use a complex negative look-behind and look-ahead. And it works for toy-XML-like constructs like this but if you want to support full XML, you need to use an XML parser. – Erwin Bolwidt Feb 08 '14 at 14:49
1

Here's the split regex you want:

String[] split = str.split(" +(?=[^<]*(<[^/]|$)");
Bohemian
  • 365,064
  • 84
  • 522
  • 658
0

kiltek, resurrecting this question because it had a simple regex solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

With all the disclaimers about using regex to parse xml, here is a simple regex to do it:

<.*?</[^>]*>|( )

The left side of the alternation matches complete xml tags. We will ignore these matches. The right side matches and captures spaces to Group 1, and we know they are the right spaces because they were not matched by the expression on the left.

Here is working code (see online demo):

import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;

class Program {
public static void main (String[] args) throws java.lang.Exception  {

String subject = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>";
Pattern regex = Pattern.compile("<.*?</[^>]*>|( )");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
    if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
    else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits) System.out.println(split);
} // end main
} // end Program

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...
Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97