5

I've got a regular expression that I'm trying to match against the following types of data, with each token separated by an unknown number of spaces.

Update: "Text" can be almost any character, which is why I had .* initially. Importantly, it can also include spaces.

  1. Text
  2. Text 01
  3. Text 01 of 03
  4. Text 01 (of 03)
  5. Text 01-03

I'd like to capture "Text", "01", and "03" as separate groups, and all except "Text" are optional. The best I've been able to do so far is:

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)

This matches #3-#5, and puts them in the proper capture groups. I can't figure out, though, why when I add an additional ? to the end to make the part of the expression after 01 optional, my capture groups get all funky.

\s*(.*)\s+(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?\s*)?

The RegEx above matches #2-#5, but the capture groups are correct only for #2 and #5.

This seems like a straightforward regular expression, so I don't know why I'm having so much trouble with it.

This is a link to an online RegEx evaluator I'm using to help me debug this: http://regexr.com?2tb64. The link already has the first RegEx and the test data filled in.

Dov
  • 14,039
  • 12
  • 71
  • 151

3 Answers3

6

You didn't say which regex tool you are using so I am assuming the least common denominator i.e. Javascript. Here is one that works:

var re = /^\s*(.+?)(?:\s+(\d+)(?:(?:\s+\(?of\s+|-)(\d+)\)?)?)?$/i;

To make this work in your Regexr tool, be sure to turn on the "multi-line option".

Here it the same thing in PHP syntax (with lots of juicy comments!):

$re = '/ # Always write non-trivial regex in free-space mode!
    ^                  # Anchor to start of string.
    \s*                # optional leading whitspace is ok.
    (.+?)              # Text can be pretty much anything.
    (?:                # Group to allow applying ? quantifier
      \s+              # WS separates "Text" from first number.
      (\d+)            # First number.
      (?:              # Group to allow applying ? quantifier
        (?:            # Second number prefix alternatives
          \s+\(?of\s+  # Either " of 03" and " (of 03)",
        | -            # or just a dash  for "-03" case.
        )              # End second number prefix alternatives
        (\d+)          # Second number
        \)?            # Match ")" for " (of 03)" case.
      )?               # Second number is optional.
    )?                 # First numebr is optional.
    $                  # Anchor to start of string.
    /ix';
ridgerunner
  • 30,685
  • 4
  • 51
  • 68
  • Thanks, that's great. I'm using RegexKitLite in Obj-C, which uses Perl syntax. The only change I had to make was allowing whitespace more liberally. My final expression is: ^\s*(.+?)(?:\s+(\d+)(?:\s*(?:\(?\s*of|-)\s*(\d+)\s*\)?)?)?\s*$ – Dov Mar 19 '11 at 10:01
1

Your Second one is close

So I reworked: regexr, matches now all in the correct groups.

\s*(\w*)\s+(?:\s*(\d+)\s*(?:\s*\(?\s*(?:of|-)\s*(\d+)\s*\)?)?)?
stema
  • 80,307
  • 18
  • 92
  • 121
  • i've put your regex in regexr.com and gotten garbled matches for the first 4 cases... plus the first group contains the whole text. http://regexr.com?2tb6a – Joe Mar 18 '11 at 23:21
  • @Joe now I fixed it, was too quick on the first try. – stema Mar 18 '11 at 23:37
  • that's almost perfect. I need to figure out something different for the first \w, since it can actually be almost anything (including spaces), not just word characters. Sorry I didn't mention it earlier, but I updated the question now. For the examples I gave, that's perfect. – Dov Mar 19 '11 at 00:17
  • @Dov, is everything you want to match in the same row, i.e. can we use anchors `^ $`? – stema Mar 19 '11 at 08:23
  • @sterna, yeah, it's a file name, so no line breaks. – Dov Mar 19 '11 at 10:06
1

Try this:
http://regexr.com?2tb67

Regex looks something like:

(\w+?)\s+(\d*)[^\d]*(\d+)

Match all letters, followed by any white spaces, then match all digits, followed by anything that's not digits, then match remaining digits.

Note that the second result probably isn't ideal for you because 01 comes in the third group match. But it matches all your cases.

Joe
  • 9,360
  • 6
  • 43
  • 56