12

I have to write some sort of parser that get a String and replace certain sets of character with others. The code looks like this:

noHTMLString = noHTMLString.replaceAll("</p>", "\n");
noHTMLString = noHTMLString.replaceAll("<br/>", "\n\n");
noHTMLString = noHTMLString.replaceAll("<br />", "\n\n");
//here goes A LOT of lines like these ones

The function is very long and performs a lot of strings replaces. The issue here is that it takes a lot of time because the method it's called a lot of times, slowing down the application performance.

I have read some threads here about using StringBuilder as an alternative but it lacks the ReplaceAll method and as it's noted here Does string.replaceAll() performance suffer from string immutability? the replaceAll method in String class works with

Match Pattern & Matcher and Matcher.replaceAll() uses a StringBuilder to store the eventually returned value so I don't know if switching to StringBuilder will really reduce the time to perform the substitutions.

Do you know a fast way to do a lot of String replace in a fast way? Do you have any advice for this problem?

Thanks.

EDIT: I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. The problem is that I have to invoke the method very often

Community
  • 1
  • 1
Averroes
  • 3,928
  • 6
  • 43
  • 61
  • What slows you down? - The length of your noHTMLString text, or do you invoke this three Statements very very often? – Ralph Nov 26 '10 at 16:42
  • I have to create a report that have a few fields with html text. For each row I'm calling the method that replaces all the html tags and special characters inside these strings. With a full report it takes more than 3 minutes to parse all the text. So I the problem is that I have to invoke the method very often. – Averroes Nov 26 '10 at 21:48
  • See also: http://stackoverflow.com/a/1765616/59087 – Dave Jarvis Nov 26 '16 at 23:47

4 Answers4

13

I found that org.apache.commons.lang.StringUtils is the fastest if you don't want to bother with the StringBuffer.

You can use it like this:
noHTMLString = StringUtils.replace(noHTMLString, "</p>", "\n");

I did performance testing it was fester than my custom StrinBuffer solution similar to the one @extraneon proposed.

MatBanik
  • 24,206
  • 38
  • 107
  • 172
  • That was indeed faster than the replaceAll from String.class. Thanks. – Averroes Nov 30 '10 at 16:48
  • 1
    See [Commons Lang StringUtils.replace performance vs String.replace](http://stackoverflow.com/questions/16228992/commons-lang-stringutils-replace-performance-vs-string-replace) with benchmark. – Vadzim Nov 26 '14 at 13:17
  • For multiple strings, it's probably faster to use [StringUtils.replaceEach](https://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#replaceEach(java.lang.String,%20java.lang.String[],%20java.lang.String[])), not that [parsing HTML](http://stackoverflow.com/a/1732454/59087) this way is a good idea. – Dave Jarvis Nov 26 '16 at 04:49
7

It looks like your parsing HTML there, have you though about using a 3rd party library instead of re-inventing the wheel?

Martijn Verburg
  • 3,227
  • 18
  • 25
4

I agree with Martijn in using a ready-built solution instead of parsing it yourself - there's loads of stuff built into Java in the javax.xml package. A neat solution would be to use XSLT transformation to replace, this looks like an ideal use case for it. However, it is complicated.

To answer the question, have you considered using the regular expression libraries? It looks like you have many different things you want to match, and replace with the same thing (\n or empty string). Using regular expressions you could be an expression like "<br>|<br/>|<br />" or even more clever like <br.*?>" to create a matcher object, on which you can call replaceAll.

Allanrbo
  • 2,037
  • 1
  • 22
  • 26
  • 2
    You cannot parse HTML with regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Adriaan Koster Nov 26 '10 at 12:36
  • 1
    Adriaan, you are right, HTML is a context free language, not a regular language. But you can do text-replacements with regular expressions, and that was what was asked about. – Allanrbo Nov 26 '10 at 12:59
3

I fully agree with Martijn here. Pick the right tool for the job.

If your file however is not HTML, but only contains some HTML tokens there are a few ways you can speed things up.

First, if some amount of the input does not contain replaceable elements, consider starting with something like:

if (!input.contains('<')) {
    return input;
}

Second, consider a regex:

Pattern p = Pattern.compile( your_regex );

Don't make a pattern for every single replaceAll line, but try to combine them (regex has a OR operator) and let Pattern optimize the regex. Do use the compiled pattern and don't compile it in every call, it's fairly expensive.

If regexes are a bit to complex you can also implement some faster (but potentially less readable) replacement engine yourself:

StringBuilder result = new StringBuilder(input.length();
for (int i=0; i < input.length(); i++) {
  char c = input.charAt(i);

  if ( c != '<' ) {
    continue;
  }

  int closePos = input.indexOf( '>', i);
  if (closePos == -1) {// not found
    result.append( input.substring(i, input.length());
    return result.toString();
  }
  i = closePos;
  String token = input.substring(i, closePos);
  if ( token.equals( "p/" ) {
    result.append("\\n");
  } else if (token.equals(...)) {
  } else if (...) {
  } 
}
return result.toString();

This may have some errors :)

The advantage is you have to iterate through the input only once. The big disadvantage is that it is not all that easy to understand. You could also write a state machine, analyzing per character what the new state should be, and that would probably be faster and even more work.

extraneon
  • 22,016
  • 2
  • 42
  • 49
  • 1
    You cannot parse HTML with regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Adriaan Koster Nov 26 '10 at 12:37
  • 1
    @Adriaan Koster : That's not what I said. I said, if you have HTML use an HTML parser. If it's plain text with HTML tags in it (which isn't parseable by an HTML parser) try it the hard way. – extraneon Nov 26 '10 at 15:47
  • 2
    @Adriaan: **WRONG!** [Yes you *can* parse HTML with regex](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491). However, you [probably don’t want to](http://stackoverflow.com/questions/4284176/doubt-in-parsing-data-in-perl-where-am-i-going-wrong/4286326#4286326) unless you have constrained and limited HTML to work with, such as you yourself have generated. Otherwise **although it is entirely possible to parse HTML with regexes**, you really and truly do not want to. – tchrist Nov 26 '10 at 16:19
  • A late nitpick: you cannot parse arbitrary HTML with a *single* regex, because regexes cannot recognize arbitrary depth recursive nesting. You can certainly perform lexical analysis (i.e. tokenize) of arbitrary HTML with one or more regexes, just as you may be able to recognize interesting parts of an HTML file. – Nicola Musatti Sep 18 '13 at 10:23