3

I am looking for Java solution to replace line breaks with <br/> tags in all multi-line text fields in a given HTML string, that are not enclosed in any tags (children of an imaginary root).

The source data is an HTML-formatted text created via front-end HTML editor (like TinyMCE). So it's an arbitrary HTML fragment - a part of a non-existing <body>.

The following:

text11
text 21<p>tagged text1
tagged text2</p>
text 2 

Should become:

text11<br/>text 21<p>tagged text1
tagged text2</p></br>text 2 

The following, however, should not be impacted at all:

<div>text11
text 21<p>tagged text1
tagged text2</p>
text 2</div> 

I was thinking about something like this (not working):

private static String ReplaceLfWithBr(String source) {
    // text - combination of words and line breaks 
    // should not be preceded by <tag> or followed by <\tag>
    final String regex = "((?!<.+>)[\\w(\\r?\\n)]+(?!<\\s*/.+>))";
    Pattern patern = Pattern.compile(regex, Pattern.MULTILINE);
    Matcher matcher = patern.matcher(source);
    StringBuffer sb = new StringBuffer(source.length());
    while(matcher.find()){
        matcher.appendReplacement(sb, "<br/>");
    }
    matcher.appendTail(sb);
    return sb.toString();
}
PM 77-1
  • 11,712
  • 18
  • 56
  • 99
  • I think you want to replace all \n and \r with – StackFlowed Oct 08 '15 at 15:37
  • You are going to want to have at least two regex statements. You need to take the string, remove everything between
    and
    . Then apply your regex pattern replace to
    – Hard Tacos Oct 08 '15 at 15:38
  • 4
    Would surrounding the text with `` tags, then using jsoup, getting the root element, getting `ownText()` on it, and doing a `\n ->
    ` replace on that work?
    – gla3dr Oct 08 '15 at 15:41
  • @gla3dr this might actually work for OP's question. I forgot about jsoup - reference: http://jsoup.org/ – Hard Tacos Oct 08 '15 at 15:42
  • 4
    @HardTacos [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348/1065197) – Luiggi Mendoza Oct 08 '15 at 15:44
  • @gla3dr - I thought about `JSoup`. In particular using [`parseBodyFragment()`](http://jsoup.org/cookbook/input/parse-body-fragment) to fill in the gaps. – PM 77-1 Oct 08 '15 at 16:02
  • @LuiggiMendoza - Just because RegEx can not be used as a universal parser for HTML, it does not mean that RegEx is incapable of doing anything with it. Especially so, if your HTML code is not completely arbitrary. – PM 77-1 Oct 08 '15 at 18:01
  • Unless you want/need to treat the string containing HTML as a simple string and do some simple work like replacing á per a, yes. But if you want to treat the string containing HTML to parse HTML like in your case, then use an HTML parser tool. That's the lesson here. – Luiggi Mendoza Oct 09 '15 at 00:00

2 Answers2

1

So it's a little more complicated than what I said in my comment, but I think something like this might work:

public static void main (String[] args)
{
    String text = "text11\n"
        + "text 21<p>tagged text1\n"
        + "tagged text2</p>\n"
        + "text 2";

    StringBuilder sb = new StringBuilder("<body>");
    sb.append(text);
    sb.append("</body>");
    Document doc = Jsoup.parseBodyFragment(sb.toString());
    Element body = doc.select("body");
    List<Node> children = body.childNodes();
    StringBuilder sb2 = new StringBuilder();
    for(Node n : children) {
        if(n instanceof TextNode) {
            n.text(n.getWholeText().replace("\n", "<br/>"));
        }
        sb2.append(n.toString());
    }
    System.out.println(sb2.toString());
}

Basically get all the Nodes, do a replace on the TextNodes, and put them back together. I'm not 100% sure this will work as-is, since I am not able to test it at the moment. But hopefully it gets the idea across.

What I said in my comment doesn't work because you have to be able to put the child elements back in place between the text. You can't do that if you just use getOwnText().

I haven't used Jsoup much myself, so improvements are welcome if anyone has any.

gla3dr
  • 1,985
  • 17
  • 27
  • Maybe just using `` instead of `` would be better. I think it'll work like this using `parseBodyFragment()`. – gla3dr Oct 08 '15 at 16:51
  • No, [`replace()` is correct](http://stackoverflow.com/questions/9849015/java-regex-using-strings-replaceall-method-to-replace-newlines). That is a common misconception. [`replace()`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence,%20java.lang.CharSequence)) will replace all occurrences of `\n`. [`replaceAll()`](http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replaceAll(java.lang.String,%20java.lang.String)) is for replacing with regex. – gla3dr Oct 08 '15 at 16:51
1

This is how I made it to work (extremely close to the accepted answer)

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Parser;


public class HtmlText {

    public static void main(String[] args) {

        String test = "text1\ntext2<tag>tagged text \n tagged continue</tag> \ntext3";

        System.out.println("-----=============----------");
        System.out.println(test);
        System.out.println("-----=============----------");
        System.out.println(ReplaceWithSoup(test));
    }

    private static String ReplaceWithSoup(String source) {
        StringBuilder sbResult = new StringBuilder();
        Document doc = Jsoup.parseBodyFragment(source);
        Element body = doc.body();
        for(Node node: body.childNodes()) {
            if(node instanceof TextNode) {
                TextNode tn = (TextNode) node;
                tn.text(tn.getWholeText().replace("\n","<br/>"));
            }

            sbResult.append(Parser.unescapeEntities(node.toString(), true));
        }

        return sbResult.toString();
    }
}
Community
  • 1
  • 1
PM 77-1
  • 11,712
  • 18
  • 56
  • 99