I got English sentences whose words are XML-tagged, for example:
<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.
There are exactly those three possibilities for xml tags as the sentence shows (<XXX>
, <YYY>
, <ZZZ>
). The word count inside any of those tags can be infinite.
I need to split them at whitespaces ignoring whitespaces inside those XML tags. The code looks like:
String mySentence = "<XXX>word1</XXX> word2 word3 <YYY>word4 word5 word6</YYY> word7 word8 word9 word10 <ZZZ>word11 word12</ZZZ>.";
String[] mySentenceSplit = mySentence.split("someUnknownRegex");
for (int i = 0; i < mySentenceSplit.length; i++) {
System.out.println(mySentenceSplit[i]);
}
Specifically for the example above the output should be like:
mySentenceSplit[0] = <XXX>word1</XXX>
mySentenceSplit[1] = word2
mySentenceSplit[2] = word3
mySentenceSplit[3] = <YYY>word4 word5 word6</YYY>
mySentenceSplit[4] = word7
mySentenceSplit[5] = word8
mySentenceSplit[6] = word9
mySentenceSplit[7] = word10
mySentenceSplit[8] = <ZZZ>word11 word12</ZZZ>.
What do i have to insert into "someUnknownRegex" to achieve this ?