-1

I saw the following regex online and wanted to implement it into my Java Application (using java.util.regex).

(?<=(<Anhang>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(<\/Anhang>))

This is supposed to match anything enclosed in '<Anhang>'.

It works fine in a JavaScript engine but I can't get it to work in Java.

Here I tested it with a JavaScript engine on regex101 against this text:

BLALBLA BLA BLA <Anhang> 
gonegone gone gone ,os .psd
</Anhang> ajdajadw

Which produced the following result:

enter image description here

So I went ahead and tried to use it in "Java Regular Expression Tester" but it either didn't match the Text or there was a Syntax error. I know that I have to escape certain characters but I just didn't get it to work, here is what I tried:

(?<=(<Anhang>))(\\w|\\d|\\n|[().,\-:;@#$%^&*\[\\]\"'+–/"/®°⁰!?{}|`~]| )+?(?=(<\"Anhang>))

(?<=(<Anhang>))(\\w|\\d|\\n|[().,\-:;@#$%^&*\[\\]\"'+–/"/®°⁰!?\{\}|`~]| )+?(?=(<\"Anhang>))

(?<=(<Anhang>))(\\w|\\d|\\n|[().,\\\\-:;@#$%^&*\[\\]\"'+–/"/®°⁰!?\{\}|`~]| )+?(?=(<\"Anhang>))
Oblivial
  • 43
  • 5
  • Have you tried `(?<=())(\\w|\\d|\\n|[().,\\-:;@#$%^&*\\[\\]\"\'+–/\/®°⁰!?{}|\`~]| )+?(?=())`? – dan1st Jan 23 '20 at 14:35
  • I did, "Unable to execute regular expression. java.util.regex.PatternSyntaxException: Illegal character range near index 36" – Oblivial Jan 23 '20 at 14:37
  • 1
    There is no formal, standardize regex language. Rather, it's a loosely grouped class of languages that vary from one engine to another. Even within a single parent language, like Java, your regex will depend on the actual library that you're using. For example, some regex engines have no support for [lookarounds](https://www.regular-expressions.info/lookaround.html). Others require a lookaround to be fixed-width, while others have no such restriction. .NET has [balanced groups](https://stackoverflow.com/questions/17003799), which few other engines have. Etc. – JDB still remembers Monica Jan 23 '20 at 14:40
  • Although in this case I'd guess that you didn't sufficiently escape the slashes: `\[\\]\"` should probably be `\\[\\]\"` – JDB still remembers Monica Jan 23 '20 at 14:43
  • Also, `(?=())` should probably be `(?=())` – JDB still remembers Monica Jan 23 '20 at 14:44
  • Thanks for the Clarification, I added information regarding the implemented Java Library – Oblivial Jan 23 '20 at 15:11
  • 1
    You should probably take a look at [Using regular expressions to parse HTML: why not?](https://stackoverflow.com/q/590747) and [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](https://stackoverflow.com/q/701166). Also on this mandatory link: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Pshemo Jan 23 '20 at 15:12

1 Answers1

4

Your regex is overcomplicated, and appears to be malformed as well. It looks like you just want the text between the <Anhang> tags, so maybe try something simpler, like this:

Pattern regex = Pattern.compile(".*<Anhang>(.+?)</Anhang>.*", Pattern.DOTALL);

String s = "BLALBLA BLA BLA <Anhang> \n" +
           "gonegone gone gone ,os .psd\n" +
           "</Anhang> ajdajadw";

Matcher m = regex.matcher(s);

if (m.matches()) {
    String capturedGroup = m.group(); // This is the text inside the tags
}

Creating a Pattern and specifying Pattern.DOTALL instead of using String.matches() is important, as it allows the . to match newline characters.

However, I think it's worth mentioning that regex is generally the wrong tool to use to parse XML or HTML. There are custom parsing libraries for that, which I suggest you look into. It avoids the risk of a "works in 99% of the cases" regex causing bugs in your code.

Jordan
  • 2,168
  • 7
  • 15
  • OR don't use regex at all. If you have HTML use HTML parser instead. BTW why `.*` at start and end of regex? Simply use `find()` instead of `matches()` and regex will not be forced to match entire text. – Pshemo Jan 23 '20 at 14:56
  • @Pshemo The `.*` is necessary because of the additional text before and after the tags. But I agree that using an actual XML/HTML parser is usually a better approach. It might be overkill if this is just a simple one-off exercise, but it's definitely worth bringing up the fact that HTML cannot actually be fully validated via regex. – Jordan Jan 23 '20 at 15:03
  • `.*` is only necessary when we use `.matches()` which requires from regex to match *entire* text. If we use `.find()` instead, regex engine will be checking if it regex pattern can be matched even in *part* of text, which eliminates need for surrounding regex with `.*`. – Pshemo Jan 23 '20 at 15:09