Victor, you've asked me to refer your answer. So I decided to take a time to write full review of it and give some hints. I'm not some kind of regex professional nor like it very much. Currently I'm working on a project that uses regex heavily so I've seen and wrote enaugh of it to answer your question pretty reliably as well as get sick of regexes.
So let's start your regex analysis:
String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";
String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";
regex = "(" + regex +")|("+ regex2 + "){1}?";
I see you've made it of three parts for readability. That's a good idea. I'll start from first part :
\\s\*public\\s\*static.*getFieldsConfig
You allow any number, including zero whitespaces between public
and static
. It could match publicstatic. Everytime use \\s+
between words that must be separated with some number of whitespaces.
(.\*?\\)\\s\*\\{.\*\\}
You allow anything to appear between first parantheses. It would match any symbol until )
. Now we reached the part that makes your regex work not as you've wanted. \\{.*\\}
is a major mistake. It will match everything until last }
before last in file any of public
private
protected
static
is reached. I've pasted your getFieldsConfig
method to java file and tested it. Using only first part of your regex ("\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))"
) mached everything from your method until last method in file.
There is no point to analyze step by step other parts, because \\{.*\\}
ruins everything. In second part (regex2
) you've mached anything from your method to last }
in file. Have you tried to print what your regex is matching? Try it:
package com.tryRegex;
import java.io.File;
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TryRegex{
public static void main(String[] args) throws IOException{
File yourFile = new File("tryFile.java");
Scanner scanner = new Scanner(yourFile, "UTF-8");
String text = scanner.useDelimiter("\\A").next(); // `\\A` marks beginning of file. Since file has only one beginning, it will scan file from start to beginning.
String regex ="\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=\\s*(public|private|protected|static))";
String regex2 = "\\s*public\\s*static.*getFieldsConfig\\(.*?\\)\\s*\\{.*\\}(?=(\\s*}\\s*$))";
regex = "(?s)(" + regex +")|("+ regex2 + "){1}?"; // I've included (?s) since we reading from file newline chars are not excluded. Without (?s) it would match anything unless your method is written in a single line.
Matcher m = Pattern.compile(regex).matcher(text);
System.out.println(m.find() ? m.group() : "No Match found");
}
}
Short and simple piece of code to show how your regex works. Handle exception if you want. Just put yourFile.java
to your project folder and run it.
Now I will show you how messy regexes actually is:
String methodSignature = "(\\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)*[\\w<>\\[\\]\\.]+\\s+\\w+\\s*\\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\))";
String regex = "(?s)" + methodSignature + ".*?(?="+ methodSignature + ")";
Basically this regex matches every method. But it also has flaws. I will explore it as well as it's flaws.
\\s*((public|private|protected|static|final|abstract|synchronized|volatile)\\s+)*
Matches any of specified modifiers (and at least one whitespace) any times including zero, since method could have no modifier. (I've left number of modifiers allowed unlimited for the sake of simplicity. In real parser I wouldn't allow this as well as wouldn't use regex for such task.)
[\\w<>\\[\\]\\.]+
This is the method's return type. It can contain word characters, <>
for generic types, []
for arrays, and .
for nested class notation.
\\s+\\w+\\s*\\
Name of the method.
\\((\\s*[\\w<>\\[\\]\\.]*\\s+\\w+\\s*,?)*\\s*\\))
Especially tricky part - method paramethers. At first you can think that this part could be easily replaced with (
. I thought this too. But then I've noticed that it matches not only methods, but anonymous classes too such as new Anonymous(someVariable){....}
Simplest and most efficient way to avoid this is by specifying method parameters structure. [\\w<>\\[\\]\\.]
is possible symbols that parameter type could be made of. \\s+\\w+\\s*,?
Parameter type is followed by at least one whitespace and parameter name. Parameter name may be followed by ,
if method contains more than one parameter.
So what's about flaws? Major flaw is classes that is defined in methods. Method can contain class definitions in it. Consider this situation:
public void regexIsAGoodThing(){
//some code
new RegexIsNotSoGoodActually(){
void dissapontingMethod(){
//Efforts put in writing this regex was pointless because of this dissapointing method.
}
}
}
This explains very well why regex is not a proper tool for such job. It is not possible to parse method from java file reliably because method may be nested structure. Method may contain class definitions and these classes can contain methods that has another class definitions and so on. Regex is caught by infinite recursion and fails.
Another case were regex would fail is comments. In comments you can type anything.
void happyRegexing(){
return void;
// public void happyRegexingIsOver(){....}
}
One more thing that we cannot forget is annotations. What if next method is annotated? That regex will match almost fine, except that it will match annotation too. This can be avoided but then regex will be even larger.
public void goodDay(){
}
@Zzzzz //This annotation can be carried out by making our regex even more larger
public void goodNight(){
}
Another one case would be blocks. What if between two methods will be either static or instance block included?
public void iWillNotDoThisAnyMore(){
}
static{
//some code
}
public void iWillNotParseCodeWithRegex(){
//end of story
}
P.S It has another flaw - it matches new SomeClass()
and everything until next method signature. You can work around this, but again - this would be work around but not an elegant code. And I haven't included end of file matching. Maybe I will add edit tomorrow if your'e interested. Going to sleep now, it's close to morning in Europe.
As you can see, regex is almost good tool for most of tasks. But we, programmers, hate word almost. We do not even have it in our vocabularies. Aren't we?