4

Take an example.

 public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}

I write a regex, "\\s*public\\s*static.*getFieldsConfig\\(.*\\)\\s*\\{"

It can match only the first line. But how to match right to the last "}" of the method?

Help me. Thanks.

Edit: The content of method {} is not specified. But pattern is surely like this,

  public static xxx theKnownMethodName(xxxx) {
    xxxxxxx
  }
Victor Choy
  • 3,281
  • 21
  • 30
  • 2
    You can't parse Java with a regex. – Andy Turner Mar 10 '16 at 09:54
  • See http://stackoverflow.com/questions/546433/regular-expression-to-match-outer-brackets - this is about brackets, but it's no different for braces. – Andy Turner Mar 10 '16 at 09:56
  • do you want to match the if statement too? – Pooya Mar 10 '16 at 09:57
  • I wanna match a complete method implementation. – Victor Choy Mar 10 '16 at 10:00
  • 1
    You can't really parse *anything* beyond a regular language with a regex. But you can sure scan it. – user207421 Mar 10 '16 at 10:01
  • 1
    @VictorChoy What people are telling you: regular expressions are not suitable for programming languages. You need a **parser**. Unless you can guarantee to 100% that all the methods your code will be dealing with are of the same pattern. In other words: maybe you want to add some information where the source you want to "match" on is coming from; and what you want to do with it. – GhostCat Mar 10 '16 at 10:25
  • @Jägermeister Thanks. I just want to know if it is possible in a simple regex way. If impossible, I will read the file to parse by myself. – Victor Choy Mar 10 '16 at 10:53

7 Answers7

4

I decided to take it one step further ;)

Here's a regex that'll give you the modifiers, type, name and body of a function in different capture groups:

((?:(?:public|private|protected|static|final|abstract|synchronized|volatile)\s+)*)
\s*(\w+)\s*(\w+)\(.*?\)\s*({(?:{[^{}]*}|.)*?})

It handles nested braces (@callOfCode it is (semi-)possible with regex ;) and a fixed set of modifiers.

It doesn't handle complicated stuff like braces inside comments and stuff like that, but it'll work for the simplest ones.

Regards

Regex101 sample here

Edit: And to answer your question ;), what you're interested in is capture group 4.

Edit 2: As I said - simple ones. But you could make it more complicated to handle more complicated methods. Here's an updated handling one more level of nesting.

((?:(?:public|private|protected|static|final|abstract|synchronized|volatile)\s+)*)
\s*(\w+)\s*(\w+)\(.*?\)\s*({(?:{[^{}]*(?:{[^{}]*}|.)*?[^{}]*}|.)*?})

And you could another level... and another... But as someone commented - this shouldn't be done by regex. This however handles simple methods.

SamWhan
  • 8,038
  • 1
  • 14
  • 42
  • @Tim007 Check Edit 2. You could add as many levels you want, but the regex will get increasingly complex. – SamWhan Mar 10 '16 at 11:44
  • @ClasG I've added some details in my new answer to explain OP why regex is very limited in real world scenarios. – callOfCode Mar 13 '16 at 02:02
2

Regex is definitely not the best tool for that, but if you want regex, and your code is well indented, you can try with:

^(?<indent>\s*)(?<mod1>\w+)\s(?<mod2>\w+)?\s*(?<mod3>\w+)?\s*(?<return>\b\w+)\s(?<name>\w+)\((?<arg>.*?)\)\s*\{(?<body>.+?)^\k<indent>\}

DEMO

It has additional named groups, you can delete them. It use a indentation level to find last }.

m.cekiera
  • 5,307
  • 5
  • 19
  • 35
1

You need to enable DOTALL mode. Then dot will match newLine chars. Just include (?s) in the beginning of your regex.

 String s = "   public static FieldsConfig getFieldsConfig(){\n"
             + "   if(xxx) {\n"
             + "              sssss;\n"
             + "   }\n"
             + "      return;\n"
             +"}";
 Matcher m = Pattern.compile("(?s)\\s*public\\s+static\\s+\\w+?\\sgetFieldsConfig\\(\\s*\\).*").matcher(s);
 m.find();
 System.out.println(m.group());

Outpup is all method body as you wanted. Without (?s) it matches only the first line. But you cannot parse java code with regex. Others already said that. This regex will match everything from beginning of method signature to the end of file. How do you match it only until the end of method body is reached? Method can contain many {....} as well as many return;. Regex is not a magic stick.

callOfCode
  • 709
  • 6
  • 10
1

Try this

((?<space>\h+)public\s+static\s+[^(]+\([^)]*?\)\s*\{.*?\k<space>\})|(public\s+static\s+[^(]+\([^)]*?\)\s*\{.*?\n\})

Explanation:
We will capture method block start by keyword public end to }, public and } must have the same \s character so your code must be well format : ) https://en.wikipedia.org/wiki/Indent_style

\h: match whitespace but not newlines
(?<space>\h+): Get all whitespace before public then group in space name
public\s+static\s public static
[^(]: any character but not (
([^)]: any but not )
\k<space>\}: } same number of whitespace then } at the end.

Demo

Input:

public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}

NO CAPTURE

public static FieldsConfig getFieldsConfig2(){
    if(xxx) {
      sssss;
    }
   return;
}

NO CAPTURE

    public static FieldsConfig getFieldsConfig3(){
        if(xxx) {
          sssss;
        }
       return;
    }

NO CAPTURE

        public static FieldsConfig getFieldsConfig4(){
            if(xxx) {
              sssss;
            }
           return;
        }

Output:

MATCH 1
3.  [0-91]  `public static FieldsConfig getFieldsConfig(){
    if(xxx) {
      sssss;
    }
   return;
}`

MATCH 2
3.  [105-197]   `public static FieldsConfig getFieldsConfig2(){
    if(xxx) {
      sssss;
    }
   return;
}`

MATCH 3
1.  [211-309]   `   public static FieldsConfig getFieldsConfig3(){
        if(xxx) {
          sssss;
        }
       return;
    }`

MATCH 4
1.  [324-428]   `       public static FieldsConfig getFieldsConfig4(){
            if(xxx) {
              sssss;
            }
           return;
        }`
Tim007
  • 2,486
  • 1
  • 9
  • 20
1

Thank all of you. After some consideration, I work out a reliable way to some degree in my situation. Now share it.

String regex ="\\s*public\s+static\s+[\w\.\<\>,\s]+\s+getFieldsConfig\\(.*?\\)\\s*\\{.*?\\}(?=\\s*(public|private|protected|static))";

String regex2 = "\\s*public\s+static\s+[\w\.\<\>,\s]+\s+getFieldsConfig\\(.*?\\)\\s*\\{.*?\\}(?=(\\s*}\\s*$))";

regex = "(" + regex +")|("+ regex2 + "){1}?";

Pattern pattern = Pattern.compile(regex, Pattern.DOTALL)

It can match my method body well.

PS Yes, the regex maybe not the suitable way to parse a method very strictly. Generally speaking, regex is less effort than programming and work right in specific situation. Adjust it and Sure it works for you.

Victor Choy
  • 3,281
  • 21
  • 30
1

Victor, you've asked me to refer your answer. So I decided to take a time to write full review of it and give some hints. I'm not some kind of regex professional nor like it very much. Currently I'm working on a project that uses regex heavily so I've seen and wrote enaugh of it to answer your question pretty reliably as well as get sick of regexes. So let's start your regex analysis:

String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";

String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";

regex = "(" + regex +")|("+ regex2 + "){1}?";

I see you've made it of three parts for readability. That's a good idea. I'll start from first part :

  • \\s\*public\\s\*static.*getFieldsConfig You allow any number, including zero whitespaces between public and static. It could match publicstatic. Everytime use \\s+ between words that must be separated with some number of whitespaces.
  • (.\*?\\)\\s\*\\{.\*\\} You allow anything to appear between first parantheses. It would match any symbol until ). Now we reached the part that makes your regex work not as you've wanted. \\{.*\\} is a major mistake. It will match everything until last } before last in file any of public private protected static is reached. I've pasted your getFieldsConfig method to java file and tested it. Using only first part of your regex ("\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))") mached everything from your method until last method in file.

There is no point to analyze step by step other parts, because \\{.*\\} ruins everything. In second part (regex2) you've mached anything from your method to last } in file. Have you tried to print what your regex is matching? Try it:

package com.tryRegex;

import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class TryRegex{

    public static void main(String[] args) throws IOException{
        File yourFile = new File("tryFile.java");
        Scanner scanner = new Scanner(yourFile, "UTF-8");
        String text = scanner.useDelimiter("\\A").next();  // `\\A` marks beginning of file. Since file has only one beginning, it will scan file from start to beginning.

        String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";
        String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";
        regex = "(?s)(" + regex +")|("+ regex2 + "){1}?";     // I've included (?s) since we reading from file newline chars are not excluded. Without (?s) it would match anything unless your method is written in a single line.

        Matcher m = Pattern.compile(regex).matcher(text);

        System.out.println(m.find() ? m.group() : "No Match found");
    }
}

Short and simple piece of code to show how your regex works. Handle exception if you want. Just put yourFile.java to your project folder and run it.

Now I will show you how messy regexes actually is:

String methodSignature = "(\\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)*[\\w<>\\[\\]\\.]+\\s+\\w+\\s*\\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\))";
String regex = "(?s)" + methodSignature + ".*?(?="+ methodSignature + ")";

Basically this regex matches every method. But it also has flaws. I will explore it as well as it's flaws.

  • \\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)* Matches any of specified modifiers (and at least one whitespace) any times including zero, since method could have no modifier. (I've left number of modifiers allowed unlimited for the sake of simplicity. In real parser I wouldn't allow this as well as wouldn't use regex for such task.)
  • [\\w<>\\[\\]\\.]+ This is the method's return type. It can contain word characters, <> for generic types, [] for arrays, and . for nested class notation.
  • \\s+\\w+\\s*\\ Name of the method.
  • \\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\)) Especially tricky part - method paramethers. At first you can think that this part could be easily replaced with (. I thought this too. But then I've noticed that it matches not only methods, but anonymous classes too such as new Anonymous(someVariable){....} Simplest and most efficient way to avoid this is by specifying method parameters structure. [\\w<>\\[\\]\\.] is possible symbols that parameter type could be made of. \\s+\\w+\\s*,? Parameter type is followed by at least one whitespace and parameter name. Parameter name may be followed by , if method contains more than one parameter.

So what's about flaws? Major flaw is classes that is defined in methods. Method can contain class definitions in it. Consider this situation:

public void regexIsAGoodThing(){
  //some code
  new RegexIsNotSoGoodActually(){
    void dissapontingMethod(){
       //Efforts put in writing this regex was pointless because of this dissapointing method.
    }
  }
}

This explains very well why regex is not a proper tool for such job. It is not possible to parse method from java file reliably because method may be nested structure. Method may contain class definitions and these classes can contain methods that has another class definitions and so on. Regex is caught by infinite recursion and fails.

Another case were regex would fail is comments. In comments you can type anything.

void happyRegexing(){
     return void;
     // public void happyRegexingIsOver(){....}
}

One more thing that we cannot forget is annotations. What if next method is annotated? That regex will match almost fine, except that it will match annotation too. This can be avoided but then regex will be even larger.

public void goodDay(){

}

@Zzzzz //This annotation can be carried out by making our regex even more larger
public void goodNight(){

}

Another one case would be blocks. What if between two methods will be either static or instance block included?

public void iWillNotDoThisAnyMore(){

}

static{
    //some code
}

public void iWillNotParseCodeWithRegex(){
    //end of story
}

P.S It has another flaw - it matches new SomeClass() and everything until next method signature. You can work around this, but again - this would be work around but not an elegant code. And I haven't included end of file matching. Maybe I will add edit tomorrow if your'e interested. Going to sleep now, it's close to morning in Europe. As you can see, regex is almost good tool for most of tasks. But we, programmers, hate word almost. We do not even have it in our vocabularies. Aren't we?

callOfCode
  • 709
  • 6
  • 10
  • Fabulous! Great admiration for you about your explicit analysis. Yes, the regex maybe not the suitable way to parse a method very strictly. Here for me, I just generate a new java file according some config and java class that has some specific structure such as all static and public method. Generally speaking, regex is less effort than programming and work right in specific situation. That's why I try to choose it. Anyway, thanks a lot ! ~~ – Victor Choy Mar 13 '16 at 08:22
0

I had to modify this answer for my own needs. I wanted capture groups for the entire method as well as the names of each method in the file. I only need these two capture groups. This requires the single line (s) flag in PCRE. The global (g) flag would be needed to in other REGEX parses to capture the full file and not just one match. I nested the bracket capture @SamWhan showed to allow five levels of nesting. This should get the job done as more is against most recommended standards. This makes this REGEX really expensive so be warned.

(?:public|private|protected|static|final|abstract|synchronized|volatile)\s*(?:(?:(?:\w*\s)?(\w+))|)\(.*?\)\s*(?:\{(?:\{[^{}]*(?:\{[^{}]*(?:\{[^{}]*(?:\{[^{}]*(?:\{[^{}]*(?:\{[^{}]*}|.)*?[^{}]*}|.)*?[^{}]*}|.)*?[^{}]*}|.)*?[^{}]*}|.)*?[^{}]*}|.)*?})
Tyler Miles
  • 346
  • 2
  • 13