1

I have a file, typically XML files. I want to replace all the occurrences of 'x.y' with 'p.q'. But during this replacement, i want to ignore the occurrences of x.y in comments ().

I was trying to use String.replaceAll() to perform this task.

For Example :

<?xml version="1.0" encoding="UTF-8"?>
<name>This occurrence of x.y should be replaced</name>
<!-- This occurrence of x.y should not be replaced -->

I tried using String.replaceAll("x[\.]y", "p.q") but i could see that occurrences in comments are also getting replaced

I could use an other alternative by which i can read the file line by line and exclude the lines that starts with comments, but i am interested in using replaceAll()

Please provide a way by which this can be achieved.

Rudi Kershaw
  • 10,659
  • 6
  • 46
  • 72

2 Answers2

2

Although this isn't strictly the answer you are looking for, I have a recommendation.

I'd recommend using a proper XML parser like Java DOM to check and replace text in your nodes, rather than dealing with your XML as a raw String. Something like this should replace the corresponding text in your node if they are not a comment.

File f = new File("your.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(f);

NodeList eList = doc.getElementsByTagName("*");
for (int e = 0; e < eList.getLength(); e++) {
    Node element = eList.item(e);
    NodeList nList = element.getChildNodes();
    for(int n = 0; n < nList.getLength(); n++){
        Node node = nList.item(n);
        if(node.getNodeType()==Node.TEXT_NODE){
            node.setNodeValue(node.getNodeValue().replace("x.y", "p.q")); 
        }
    }
}

If memory/efficiency are an issue (like when your.xml is huge), you would be better off using SAX, which is faster (a little more code intensive) and doesn't store the XML in memory.

Once your Document has been edited you'll probably want to use a Transformer to create a suitable output. (Official guide here, curtsey of Boris the Spider's comment)

Hope this helps.

Further Reading;

Community
  • 1
  • 1
Rudi Kershaw
  • 10,659
  • 6
  • 46
  • 72
  • 1
    Totally agree, +1. Although I have to say that converting a `Document` to a `String` to save it is wrong. A [`Transformer`](http://docs.oracle.com/javase/7/docs/api/javax/xml/transform/Transformer.html) is specifically designed for the task. There is a [tutorial here](http://docs.oracle.com/javase/tutorial/jaxp/xslt/writingDom.html). – Boris the Spider Aug 23 '14 at 11:09
  • @BoristheSpider - Thanks, good call, I will replace that part of the answer. – Rudi Kershaw Aug 23 '14 at 11:12
  • @JaqenH'ghar - Also a good point, it does beg the question of whether the `getElementXXX()` methods or `getChildNodes()` will also find comments. I don't think they would but I haven't tested it. – Rudi Kershaw Aug 23 '14 at 11:18
  • 1
    I think one can simply do `if (!(node instanceof Comment))` as Comment extends Node – Jaqen H'ghar Aug 23 '14 at 11:27
  • 1
    @JaqenH'ghar - Thanks for the help. I've amended the code : ) – Rudi Kershaw Aug 23 '14 at 11:34
  • 1
    Thanks for the suggestions, i am trying to explore DOM parser for parsing and replacing strings .. – Appana Sandeep Aug 23 '14 at 14:05
  • There is no need to use regex here! [`String.replace`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence,%20java.lang.CharSequence)) is **much** faster and does need the ugly escapes. – Boris the Spider Aug 23 '14 at 15:26
1

If using regex, an option would be to use lookarounds for checking to replace only outside comments:

(?s)x\.y(?!(?:(?!<!--).)+-->)

As a Java string:

"(?s)x\\.y(?!(?:(?!<!--).)+-->)"

Used the (?s) DOTALL modifier for making the . also match newlines.

Test at regexplanet (click on Java)

Jonny 5
  • 11,051
  • 2
  • 20
  • 42
  • This works fine for XML comments. I was trying to apply the same patter for excluding comments in .properties file '#' using x\\.y(?!(?:(?!#).)+), but it is not working. The text in the # line is also getting matched .. Is there any thing which i am missing here – Appana Sandeep Aug 23 '14 at 14:02
  • @AppanaSandeep That's a different task. If you know e.g. one line can't be longer such as `1024` try this in `(?m)` *multiline*-mode: `(?m)(? – Jonny 5 Aug 23 '14 at 17:37
  • This works great .. Can i know the explanation for this ..? Is there any link where i can learn more info on regex ? – Appana Sandeep Aug 23 '14 at 17:53
  • @AppanaSandeep It matches in `(?m)` multi-line mode: `^` and `$` match start and end of each line. `(? – Jonny 5 Aug 23 '14 at 18:08
  • @AppanaSandeep To learn more about regex see: [SO Regex FAQ](http://stackoverflow.com/a/22944075/3110638), [RexEgg](http://www.rexegg.com/), [regular-expressions.info](http://www.regular-expressions.info/tutorial.html), test on [regex101](http://regex101.com/) and read explanations, read [Jeffrey Friedl's book](http://regex.info/) :) – Jonny 5 Aug 23 '14 at 18:10