44

Is it possible to skip a couple of characters in a capture group in regular expressions? I am using .NET regexes but that shouldn't matter.

Basically, what I am looking for is:

[random text]AB-123[random text]

and I need to capture 'AB123', without the hyphen.

I know that AB is 2 or 3 uppercase characters and 123 is 2 or 3 digits, but that's not the hard part. The hard part (at least for me) is skipping the hyphen.

I guess I could capture both separately and then concatenate them in code, but I wish I had a more elegant, regex-only solution.

Any suggestions?

Tamas Czinege
  • 110,351
  • 39
  • 146
  • 173
  • 1
    in javascript you could: /(AB)\-(123))/.exec("[random text]AB-123[random text]"); its now return array [1] and [2] ^^ – hanshenrik Mar 26 '15 at 11:01
  • What about using positive lookahead (?=) and positive lookbehind (?<=)? Basically this: (?<=\')([A-Z]{2}-[0-9]{3})(?=\') should work. – It's me ... Alex Jun 01 '15 at 07:26

6 Answers6

49

In short: You can't. A match is always consecutive, even when it contains things as zero-width assertions there is no way around matching the next character if you want to get to the one after it.

Tomalak
  • 306,836
  • 62
  • 485
  • 598
20

There really isn't a way to create an expression such that the matched text is different than what is found in the source text. You will need to remove the hyphen in a separate step either by matching the first and second parts individually and concatenating the two groups:

match = Regex.Match( text, "([A-B]{2,3})-([0-9]{2,3})" );
matchedText = string.Format( "{0}{1}", 
    match.Groups.Item(1).Value, 
    match.Groups.Item(2).Value );

Or by removing the hyphen in a step separate from the matching process:

match = Regex.Match( text, "[A-B]{2,3}-[0-9]{2,3}" );
matchedText = match.Value.Replace( "-", "" );
Jeff Hillman
  • 7,250
  • 3
  • 30
  • 34
4

Your assertion that its not possible to do without sub-grouping + concatentating it is correct.

You could also do as Jeff-Hillman and merely strip out the bad character(s) after the fact.

Important to note here tho, is you "dont use regex for everything".

Regex is designed for less complicated solutions for non-trivial problems, and you shouldn't use "oh, we'll use a regex" for everything, and you shoudn't get into the habbit of thinking you can solve the problem in a one-step regex.

When there is a viable trivial method that works, by all means, use it.

An alternative Idea, if you happen to be needing to return multiple matches in a body of code is look for your languages "callback" based regex, which permits passing any matched/found group to a function call which can do in-line substitution. ( Especially handy in doing regexp replaces ).

Not sure how it would work in .Net, but in php you would do something like ( not exact code )

  function strip_reverse( $a )
  {
     $a = preg_replace("/-/", "", $a );
     return reverse($a);
  }
  $b = preg_replace_callback( "/(AB[-]?cde)/" , 'strip_reverse' , "Hello World AB-cde" ; 
Kent Fredric
  • 54,014
  • 14
  • 101
  • 148
  • 1
    It is a common misunderstanding that regex is for "less complicated siutations" only. Regex is immensely powerful and con solve really complex stuff. Regex is just not the right tool for things that are not regular. It's simple: There are things that work with regex, and there are those that don't. – Tomalak Nov 10 '08 at 11:13
  • 1
    yes, but theres a prolific /overuse/ of regex in situations where the solution is using a firearm to holepunch paper. it'll work, but there are complications that don't exist in the simpler solution. The key is knowing when *not* to use regex ;) – Kent Fredric Nov 10 '08 at 11:34
  • Knowing when to use which tool is always the key. I would probably avoid using regex in a long loop when there was another way (say, "indexOf" plus a little math). – Tomalak Nov 10 '08 at 12:20
  • For those cases there is the "study regex" optimisation which makes a memory tree to boost regex matching ;) – Kent Fredric Nov 10 '08 at 12:35
4

You can use nested capture groups, like this:

((AB)-(123))

The first capture group is AB-123, the second is AB, and the third is 123. Then all you would have to do is join the second and third group with a space.

Alan Moore
  • 68,531
  • 11
  • 88
  • 149
Steve
  • 41
  • 1
0

Kind of late, but I think I figured this one out. At least one way to do it.

I used positive lookahead to stop at the # sign in my text. I didn't want the space or the # sign, so I had to figure a way out to "skip" over them. So when I was forced to match them again, I dumped them into a garbage group that I didn't plan on using (.ie, a bit bucket) which in the code is . Now, my place pointer is one character position beyond the # sign (where I want to be, skipping the space and the # sign). And I now just match to the end of the file name at the . and ignore the file extension.

(?i)English\\(?<Series>[^ ]+) - (?<Title>.+(?= #))(?<garb1>..)(?<Number>[^.]+)(?-i)

The Filename this was used on is

F:\Downloads\Downloads\500 Comics CCC CBR English\Isukani - Great Girl #01.cbr
LW001
  • 1,812
  • 4
  • 19
  • 28
Logan9773
  • 13
  • 3
0

I am kind of new to this, but you could use the vertical bar symbol |, which acts as an OR.

This could work for .NET:

((?<=[A-Z]{2}-)\d\d\d)|([A-Z]{2}(?=-\d\d\d))

This works for me in a VIM syntax file:

\(\([A-Z]\{2}-\)\@<=\d\d\d\)\|\([A-Z]\{2}\(-\d\d\d\)\@=\)
rky
  • 13
  • 2