10

I have a String that contains 2 or 3 company names each enclosed in parentheses. Each company name can also contains words in parentheses. I need to separate them using regular expressions but didn't find how.

My inputStr:

(Motor (Sport) (racing) Ltd.) (Motorsport racing (Ltd.)) (Motorsport racing Ltd.)
or 
(Motor (Sport) (racing) Ltd.) (Motorsport racing (Ltd.))

The expected result is:

str1 = Motor (Sport) (racing) Ltd.
str2 = Motorsport racing (Ltd.)
str3 = Motorsport racing Ltd.

My code:

String str1, str2, str3;
Pattern p = Pattern.compile("\\((.*?)\\)");
Matcher m = p.matcher(inputStr);
int index = 0;
while(m.find()) {

    String text = m.group(1);
    text = text != null && StringUtils.countMatches(text, "(") != StringUtils.countMatches(text, ")") ? text + ")" : text;

    if (index == 0) {
        str1= text;
    } else if (index == 1) {
        str2 = text;
    } else if (index == 2) {
        str3 = text;
    }

    index++;
}

This works great for str2 and str3 but not for str1.

Current result:

str1 = Motor (Sport)
str2 = Motorsport racing (Ltd.)
str3 = Motorsport racing Ltd.
xingbin
  • 23,890
  • 7
  • 43
  • 79
Eqr444
  • 103
  • 5
  • can you tell us more about the input? for example I can see that the company information ends with `(Ltd.)` or `Ltd.` is that always set there or it can be changed? – YCF_L May 08 '18 at 10:09
  • 1
    Try `\(((?:[^()]+|\([^\)]*\))*)\)`. Live demo (matches at right): https://regex101.com/r/ppnfjy/1 – revo May 08 '18 at 10:10
  • 3
    You shouldn’t use regexes for nested structures. But if you really must, look here: https://stackoverflow.com/questions/47162098/is-it-possible-to-match-nested-brackets-with-regex-without-using-recursion-or-ba – Erwin Bolwidt May 08 '18 at 10:14
  • @ErwinBolwidt Do you see a need for matching nested parentheses in question? – revo May 08 '18 at 10:33
  • @revo do you **not**? – Kevin Anderson May 08 '18 at 10:38
  • @KevinAnderson No, leaving outermost parentheses alone, you don't see any nested parentheses. Matching outermost parentheses doesn't fall into recursions it's a linear match. In Regular Expressions world nesting means more than one level which brings a need for recursive matches. – revo May 08 '18 at 10:40
  • @revo obviously – Erwin Bolwidt May 08 '18 at 10:44
  • @ErwinBolwidt What you are referring to as nested structure that makes it hard for Regular Expressions to deal with is not above patterns. Those in OP are trivial for an engine to match. Even ancient POSIX BRE can do it. Read the comment above yours. – revo May 08 '18 at 10:47
  • @revo do you have a crystal ball for the op’s requirements? It doesn’t specify a limit to the nesting level. – Erwin Bolwidt May 08 '18 at 10:55
  • @ErwinBolwidt I see what is there. You're the one who makes assumptions. – revo May 08 '18 at 11:05

4 Answers4

9

You can solve this problem without regex; refer to this question about how to find the outermost parentheses.

Here is an example:

import java.util.Stack;

public class Main {

    public static void main(String[] args) {
        String input = "(Motor (Sport) (racing) Ltd.) (Motorsport racing (Ltd.)) (Motorsport racing Ltd.)";
        for (int index = 0; index < input.length(); ) {
            if (input.charAt(index) == '(') {
                int close = findClose(input, index);  // find the  close parentheses
                System.out.println(input.substring(index + 1, close));
                index = close + 1;  // skip content and nested parentheses
            } else {
                index++;
            }
        }
    }
    private static int findClose(String input, int start) {
        Stack<Integer> stack = new Stack<>();
        for (int index = start; index < input.length(); index++) {
            if (input.charAt(index) == '(') {
                stack.push(index);
            } else if (input.charAt(index) == ')') {
                stack.pop();
                if (stack.isEmpty()) {
                    return index;
                }
            }
        }
        // unreachable if your parentheses is balanced
        return 0;
    }

}

Output:

Motor (Sport) (racing) Ltd.
Motorsport racing (Ltd.)
Motorsport racing Ltd.
xingbin
  • 23,890
  • 7
  • 43
  • 79
  • -1 This should be at most a comment, looking at it as a suggestion for a different approach. Answering a question 'I need help with method A' with 'use method B' is not helpful in the sense of error removal – ifloop May 08 '18 at 10:25
  • 2
    @ifloop Even if method B is more efficient than method A? – xingbin May 08 '18 at 10:30
  • Yes. You keep confusing a suggestion with an answer. If the question was _What is the best/most efficient way to do xyz_ or if OP added _or is there a better/simpler/more performant way of doing it_, then your contribution would classify as an answer. – ifloop May 08 '18 at 10:37
  • 6
    @ifloop On a grammatical level, you're right. On the problem-solving approach, I disagree. Alternative approaches are useful. You might want to check this meta post on SO being unwelcoming: https://meta.stackoverflow.com/questions/366692/how-do-you-know-stack-overflow-feels-unwelcoming – Tamas Rev May 08 '18 at 13:29
  • I am aware of that, but I am also aware of stackoverflow itself being a meme of _Need help with something? Go to SO, they cannot help you with your problem, but show you other ways that you didnt ask for._ When asked if it makes a difference filling the gas tank of your car full or only half in terms of efficiency and weight, you can bet your a** SO tells you that electric cars are better, or car pooling and let the owner fill the tank, or drive by train. All nice suggestions but totally NOT answers to the original question – ifloop May 08 '18 at 15:12
  • 1
    @ifloop You're right except the part where the OP will mention if he doesn't want any different approach. – Napstablook May 09 '18 at 11:40
  • 1
    Different approaches to a problem broadens your perspective and helps you understand the issue better. If that alternative approach doesn't solve OPs problem then he can easily mention that as a response. – Napstablook May 09 '18 at 11:41
7

So we can assume that the parentheses can nest at most two levels deep. So we can do it without too much magic. I would go with this code:

List<String> matches = new ArrayList<>();
Pattern p = Pattern.compile("\\([^()]*(?:\\([^()]*\\)[^()]*)*\\)");
Matcher m = p.matcher(inputStr);
while (m.find()) {
    String fullMatch = m.group();
    matches.add(fullMatch.substring(1, fullMatch.length() - 1));
}

Explanation:

  • First we match a parenthesis: \\(
  • Then we match some non-parenthesis characters: [^()]*
  • Then zero or more times: (?:...)* we will see some stuff within parentheses, and then some non-parentheses again:
  • \\([^()]*\\)[^()]* - it's important that we don't allow any more parentheses within the inside parentheses
  • And then the closing parenthesis comes: \\)
  • m.group(); returns the actual full match.
  • fullMatch.substring(1, fullMatch.length() - 1) removes the parentheses from the start and the end. You could do it with another group too. I just didn't want to make the regex uglier.
Boann
  • 44,932
  • 13
  • 106
  • 138
Tamas Rev
  • 6,418
  • 4
  • 28
  • 47
6

Why not just solve it using a stack? It will have O(n) complexity only

  1. Just parse the string and everytime you come across a '(', push it to the stack and everytime you come across a ')' , pop from the stack. else, put the character in a buffer.
  2. If the stack is empty while pushing a '(' then that means it is in a company name so also put that in the buffer.
  3. Similarly, if the stack isn't empty after popping, then put the ')' in the buffer as it is part of the company name.
  4. If the stack is empty after popping, that means that the first company name has ended and the buffer value is the name of the company and clear the buffer.

    String string = "(Motor (Sport) (racing) Ltd.) (Motorsport racing (Ltd.)) (Motorsport racing Ltd.)";
    List<String> result = new ArrayList();
    StringBuffer buffer = new StringBuffer();
    
    Stack<Character> stack = new Stack<Character>();
    for (int j = 0; j < string.length(); j++) {
        if (string.charAt(j) == '(') {
            if (!stack.empty())
                buffer.append('(');
            stack.push('(');
        } else if (string.charAt(j) == ')') {
            stack.pop();
            if (stack.empty()) {
                result.add(buffer.toString());
                buffer = new StringBuffer();
            }else
                buffer.append(')');
        }else{
            buffer.append(string.charAt(j));
        }
    }
    
    for(int i=0;i<result.size();i++){
        System.out.println(result.get(i));
    }
    
Napstablook
  • 524
  • 1
  • 3
  • 19
  • -1 This should be a comment, looking at it as a suggestion for a different approach. Answering a question 'I need help with method A' with 'use method B' is not helpful in the sense of error removal (see comments of neng_liu's answer) – ifloop May 08 '18 at 10:43
  • 6
    @ifloop it's cool that you explained your -1. However, this kind of comments and -1 -s make SO an unwelcoming place. I think it's okay to post answers that go outside of the box. Sometimes those are the popular answer, i.e. when OP wants to parse xml with regex. – Tamas Rev May 08 '18 at 13:25
  • Since the only thing you ever push to the stack is `'('`, you don't need a real stack, just an `int depth` to keep track of the stack depth, i.e. the number of unclosed parentheses you have. – Boann May 09 '18 at 02:14
  • @Boann You're right. I didn't think of that at the time. – Napstablook May 09 '18 at 11:44
4

I see every opening parenthesis has a closing counterpart and I don't see any possibilities for nested parentheses to occur. So having balanced parentheses with no nested ones lead to such regex:

\(((?:[^()]*|\([^)]*\))*)\)

You only need to have an access to first capturing group.

Live demo

Breakdown:

  • \( Match an opening parenthesis
    • ( Start of capturing group 1
      • (?: Start of non-capturing group 1
        • [^()]* Match character(s) which is / are not in set, optional
        • | Or
        • \([^\)]*\) Match group of parentheses
      • )* As much as possible, end of non-capturing group 1
    • ) End of capturing group 1
  • \) Match a closing parenthesis
revo
  • 43,830
  • 14
  • 67
  • 109