Regex to match and limit character classes

Question

I'm not sure if this is possible using Regex but I'd like to be able to limit the number of underscores allowed based on a different character. This is to limit crazy wildcard queries to a search engine written in Java.

The starting characters would be alphanumeric. But I basically want a match if there are more underscores than preceding characters. So

BA_ would be fine but BA___ would match the regex and would get kicked out of the query parser.

Is that possible using Regex?

I'd personally just match letters in one group, underscores in another, then assert that the length of the second group is one less than the first. — roippi, May 21 '14 at 18:23
@roippi I'm passing the regex into another tool so it would need to be in a single expression. — Robby Pond, May 21 '14 at 18:29

Casimir et Hippolyte · Accepted Answer · 2014-06-02T23:11:09.563

Yes you can do it. This pattern will succeed only if there are less underscores than letters (you can adapt it with the characters you want):

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$

(as Pshemo notices it, anchors are not needed if you use the matches() method, I wrote them to illustrate the fact that this pattern must be bounded whatever the means. With lookarounds for example.)

negated version:

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$

The idea is to repeat a capture group that contains a backreference to itself + an underscore. At each repetition, the capture group is growing. ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ will match all letters that have a correspondant underscore. You only need to add [A-Z]+ to be sure that there is more letters, and to finish your pattern with \\1? that contains all the underscores (I make it optional, in case there is no underscore at all).

Note that if you replace [A-Z]+ with [A-Z]{n} in the first pattern, you can set exactly the number of characters difference between letters and underscores.

To give a better idea, I will try to describe step by step how it works with the string ABC-- (since it's impossible to put underscores in bold, i use hyphens instead) :

 In the non-capturing group, the first letter is found 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 let's enter the lookahead (keep in mind that all in the lookahead is only
 a check and not a part of the match result.)
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 the first capturing group is encounter for the first time and its content is not
 defined. This is the reason why an optional quantifier is used, to avoid to make
 the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 the first hyphen is matched. Once the capture group closed, the first capture
    group is now defined and contains one hyphen. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 The lookahead succeeds, let's repeat the non-capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 The second letter is found
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We enter the lookahead
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 but now, things are different. The capture group was defined before and
 contains an hyphen, this is why \1?+ will match the first hyphen.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the literal hyphen matches the second hyphen in the string. And now the
 capture group 1 contains the two hypens. The lookahead succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 We repeat one more time the non capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 In the lookahead. There is no more letters, it's not a problem, since
 the * quantifier is used.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 \\1?+ matches now two hyphens.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 but there is no more hyphen in the string for the literal hypen and the regex
 engine can not use the bactracking since \1?+ has a possessive quantifier.
 The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 ensure that there is at least one more letter.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 We match the end of the string with the backreference to capture group 1 that
 contains the two hyphens. Note that the fact that this backreference is optional
 allows the string to not have hyphens at all. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 This is the end of the string. The pattern succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)

Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)

 The non-capturing group is repeated three times and `ABC` are matched:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 Note that at this step the first capturing group contains ---
 But after the non capturing group, there is no more letter to match for [A-Z]+
 and the regex engine must backtrack.
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Question: How many hyphens are in the capture group now?
Answer: Always three!

If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.

 Then the letter C is found:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 And the three hyphens
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 The pattern succeeds
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:

$pattern = <<<'EOD'
~
 (?(DEFINE)
     (?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
 )

 \A (?: \g<neutral> | _ )+ \z
~x
EOD;

var_dump(preg_match($pattern, '____ABC_DEF___'));

Which flavor of regex is this? I find this interesting. Could you explain a little more? I find it hard to understand the recursive part. Regex101 doesn't really help. It doesn't even match there. Can you provide a website where the above regex can be tested? — Farhad Alizadeh Noori, May 21 '14 at 19:26
@FarhadAliNoo: it is a version for Java, but the same technic can be use with PCRE too. Take a look at this post: http://stackoverflow.com/questions/23001137/capturing-quantifiers-and-quantifier-arithmetic/23002044#23002044 — Casimir et Hippolyte, May 21 '14 at 19:29
Great explanation! So if I want to find words with equal number of letters and `_` like `XX__` all I need is `(?:[^_](?=[^_]*(\\1?+_)))*\\1`. What I can't find in your explanation is why we need possessive quantifier (`+`) in `\1?+` (BTW it seems that we don't need it near `[A-Z]*+`, `[A-Z]*` seems to be enough). — Pshemo, May 21 '14 at 20:51
@Pshemo: exact, this possessive quantifier wasn't specialy needed. I have highlighted the part where I speak about `\1?+` for you. — Casimir et Hippolyte, May 21 '14 at 21:00
I think I am starting to see what is going in here :) As I said earlier great explanation. — Pshemo, May 21 '14 at 21:05
@Pshemo: about finding strings with equal number of chars and `_`, you only need: `^(?:[^_](?=[^_]*(\\1?+_)))*\\1$` (don't forget to use anchors.) — Casimir et Hippolyte, May 21 '14 at 21:07
`\b` would be probably better if I want to use `find`, in case of `matches` I don't need `^` or `$` since it by defaults validates entire string over regex. — Pshemo, May 21 '14 at 21:09
@CasimiretHippolyte Thank you very much for the explanation. I have a question though. At the end of the pattern after we exit the non-capturing group. Is there a need for a `[A-Z]+\1?` Don't we come out of the capturing group always a character before the hyphens? wouldn't `[A-Z]\1?` suffice? — Farhad Alizadeh Noori, May 21 '14 at 21:10
@CasimiretHippolyte Oh I see now. Question retracted. Thanks. — Farhad Alizadeh Noori, May 21 '14 at 21:16
@FarhadAliNoo: Indeed, you have the choice (depending of what you want to do) between one letter or at least one letter. — Casimir et Hippolyte, May 21 '14 at 21:39
@Pshemo: depending of the Java method you choose, you need to add or not explicit anchors. In this post `^` and `$` are only symbolic and can be replaced with lookarounds, word boundaries, `\G`, depending what are the characters you choose and what you are trying to do. — Casimir et Hippolyte, May 21 '14 at 21:59
@CasimiretHippolyte Awesome answer. Also is it possible to modify so that the characters do not have to precede the _ meaning _b__ or __b would match? — Robby Pond, May 30 '14 at 16:17
@RobbyPond: I'm not sure to well understand what you are looking for. — Casimir et Hippolyte, May 30 '14 at 18:06
Ok. Thanks. what I meant was the solution works great when its characters first like ars__ but I'd like the characters and underscore rule to apply to any position of the characters. So it would match with ars____ and also _____ars or a_____rs. In other words match anytime there are more underscores than letters/numbers anywhere in the string if possible. Thanks. — Robby Pond, Jun 02 '14 at 17:21
@RobbyPond: In this case, the most efficient way is to count the number of underscores in the string and to compare it with the string length. However, for the challenge, I will try to build a pattern for that. — Casimir et Hippolyte, Jun 02 '14 at 17:31
@RobbyPond: unfortunatly it isn't possible in pure regex with Java since you need the recursion feature to do it! In particular for this kind of strings: `AAA___BBB_C_________` — Casimir et Hippolyte, Jun 02 '14 at 18:37

score 0 · Answer 2 · answered May 21 '14 at 18:37

Its not possible in singular regular expression.

i) Logic needs to be implemented to get number of characters before underscores(regular expression should be written to get characters word before underscore).

ii) And validate result (number of characters - 1) = number of semicolons followed(regular expression which returns stream of underscores followed by characters).

score 0 · Answer 3 · edited May 23 '17 at 12:06

Edit: Dang! I just noticed that you need this for java. Anyways...I leave it here if someone from the .Net world stumbles upon this post.

You can use Balancing Groups if you are using .Net:

^(?:(?<letter>[^_])|(?<-letter>_))*(?(letter)(?=)|(?!))$

The .net regex engine has the ability to maintain all captured patterns in the captured groups. In other flavors the captured group would always contain the last matched pattern but in .net all previous matches are contained in a capture collection for your use. Also the .net engine has the ability to push and pop to the stack of the captured groups using the ?<group-name>, ?<-group-name> constructs. These two handy constructs can be utilized to match pairs of paranthesis, etc.

In the above regex, the engine starts from the start of the string and tries to match anything other than "_". This of course can be changed to whatever works for you(e.g [A-Z][a-z]). The alternation basically means either match [^\_] or [\_] and doing so either push or pop from the captured group.

The latter part of the regex is a conditional (?(group-name)true|false). It basically says, if the group still exists(more pushes than pops), then do the true section and if not do the false section. The easiest way to make the pattern match is to use an empty positive look ahead: (?=) and the easiest way to make it fail is (?!) which is a negative lookahead.

Regex to match and limit character classes

3 Answers3