2

I am attempting to write a spoiler identification system so that any spoilers in a string are replaced with a specified spoiler character.

I want to match a string surrounded by square brackets, such that the contents within the square brackets is capture group 1, and the whole string including the surrounding brackets is the match.

I am currently using \[(.*?]*)\], a slight modification of the expression found in this answer here, as I also want nested square brackets to be a part of capture group 1.

The problem with that expression is that, although it works and matches the following:

  • Jim ate a [sandwich] matches [sandwich] with sandwich as group 1
  • Jim ate a [sandwich with [pickles and onions]] matches [sandwich with [pickles and onions]] with sandwich with [pickles and onions] as group 1
  • [[[[] matches [[[[] with [[[ as group 1
  • []]]] matches []]]] with ]]] as group 1

However, if I want to match the following, it does not work as expected:

  • Jim ate a [sandwich with [pickles] and [onions]] matches both:
    • [sandwich with [pickles] with sandwich with [pickles as group 1
    • [onions]] with onions] as group 1

What expression should I use such that it matches [sandwich with [pickles] and [onions]] with sandwich with [pickles] and [onions] as group 1?

EDIT:

As it seems impossible to achieve this in Java using regex, is there an alternative solution?

EDIT 2:

I also want to be able to split the string by each match found, so an alternative to regular expressions would be harder to implement due to String.split(regex) being convenient. Here's an example:

  • Jim ate a [sandwich] with [pickles] and [dried [onions]] matches all:
    • [sandwich] with sandwich as group 1
    • [pickles] with pickles as group 1
    • [dried [onions]] with dried [onions] as group 1

And the split sentence should look like:

Jim ate a
with
and
Community
  • 1
  • 1
driima
  • 584
  • 1
  • 7
  • 25
  • It isn't possible with java or javascript regex. – Casimir et Hippolyte Oct 17 '15 at 15:09
  • Can I have an explanation as to why it isn't possible? And the question was for regex, not Java or Javascript regex. Are you telling me that it might be possible in other languages? – driima Oct 17 '15 at 15:10
  • 2
    Yes, regex engines are different between languages, that's why it doesn't make sense to ask a regex question without the used language or application. To match an unknown level of nested brackets, you need the recursion feature (available in PCRE, Perl) or the balancing group feature (available in .net). Java and javascript don't have one of these features. – Casimir et Hippolyte Oct 17 '15 at 15:15
  • Thank you. I will seek alternative solutions. – driima Oct 17 '15 at 15:16
  • I don't understand how the last example matches two things as group 1. Did you mean "as group 2" the second time? ***and...*** what is the desired output of this: `[one] bettween [two]`? – Al.G. Oct 17 '15 at 15:18
  • 1
    The alternative is simple: build your own parser that walks character by character, use a "stack" variable, when the char is an opening bracket, increment it, when the char is a closing bracket decrement it. When the stack is zero, the brackets are balanced. – Casimir et Hippolyte Oct 17 '15 at 15:19
  • @Al.G. No. It matches two strings, each with one capturing group. Not one string with two capturing groups. – driima Oct 17 '15 at 15:20
  • @Al.G. The desired output of `[one] between [two]` is two matches. **Match 1** would be `[one]` with group 1 being `one`. **Match 2** would be `[two]` with group 1 being `two`. – driima Oct 17 '15 at 15:26
  • If your goal is to replace each brackets with another thing (like in a template system), an other solution consists to replace innermost brackets with `\[([^\]\[]*)\]` until there is nothing to replace. – Casimir et Hippolyte Oct 17 '15 at 15:33
  • My goal is to create a spoiler system such that anything inside `[]` is to be hidden. – driima Oct 17 '15 at 15:41
  • Try this `\[(\[.*])\]` – james jelo4kul Oct 17 '15 at 16:47
  • That won't match as intended. – driima Oct 17 '15 at 16:49

1 Answers1

2

More direct solution

This solution will omit empty or whitespace only substrings

public static List<String> getStrsBetweenBalancedSubstrings(String s, Character markStart, Character markEnd) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastCloseBracket= 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
            if (c == markStart) {
                    level++;
                    if (level == 1 && i != 0 && i!=lastCloseBracket &&
                        !s.substring(lastCloseBracket, i).trim().isEmpty()) {
                            subTreeList.add(s.substring(lastCloseBracket, i).trim());
                }
            }
        } else if (c == markEnd) {
            if (level > 0) { 
                level--;
                lastCloseBracket = i+1;
            }
            }
    }
    if (lastCloseBracket != s.length() && !s.substring(lastCloseBracket).trim().isEmpty()) {
        subTreeList.add(s.substring(lastCloseBracket).trim());  
    }
    return subTreeList;
}

Then, use it as

String input = "Jim ate a [sandwich][ooh] with [pickles] and [dried [onions]] and ] [an[other] match] and more here";
List<String> between_balanced =  getStrsBetweenBalancedSubstrings(input, '[', ']');
System.out.println("Result: " + between_balanced);
// => Result: [Jim ate a, with, and, and ], and more here]

Original answer (more complex, shows a way to extract nested parentheses)

You can also extract all substrings inside balanced parentheses and then split with them:

String input = "Jim ate a [sandwich] with [pickles] and [dried [onions]] and ] [an[other] match]";
List<String> balanced = getBalancedSubstrings(input, '[', ']', true);
System.out.println("Balanced ones: " + balanced);
List<String> rx_split = new ArrayList<String>();
for (String item : balanced) {
    rx_split.add("\\s*" + Pattern.quote(item) + "\\s*");
}
String rx = String.join("|", rx_split);
System.out.println("In-betweens: " + Arrays.toString(input.split(rx)));

And this function will find all []-balanced substrings:

public static List<String> getBalancedSubstrings(String s, Character markStart, 
                                     Character markEnd, Boolean includeMarkers) {
    List<String> subTreeList = new ArrayList<String>();
    int level = 0;
    int lastOpenBracket = -1;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == markStart) {
            level++;
            if (level == 1) {
                lastOpenBracket = (includeMarkers ? i : i + 1);
            }
        }
        else if (c == markEnd) {
            if (level == 1) {
                subTreeList.add(s.substring(lastOpenBracket, (includeMarkers ? i + 1 : i)));
            }
            if (level > 0) level--;
        }
    }
    return subTreeList;
}

See IDEONE demo

Result of the code execution:

Balanced ones: ['[sandwich], [pickles], [dried [onions]]', '[an[other] match]']
In-betweens: ['Jim ate a', 'with', 'and', 'and ]']

Credits: the getBalancedSubstrings is based on the peter.murray.rust's answer for How to split this “Tree-like” string in Java regex? post.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397