How to match different groups in regex

Question

I have the following string:

"Josua de Grave* (1643-1712)"

Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.

Following this logic I'd like to have 3 match groups for each one of the item. I tried

([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})

"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)

but that returns nil.

Why is my logic wrong, and what should I do to get the 3 intended match groups.

Pavneet_Singh · Accepted Answer · 2020-02-17T09:16:59.173

1

The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use

([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
//               ^^                   ^^

since brackets represents the captured group so escape them using \ to match them as a character.

edited Feb 17 '20 at 09:16

answered Feb 15 '20 at 17:46

Pavneet_Singh

34,557
5
43
59

Be careful using `\s` as it matches more than a space: `/[ \t\r\n\f\v]/`. Also, it's not necessary to use the alternate `|` inside `[``]`. – the Tin Man Feb 16 '20 at 01:37
@theTinMan `\s` is the requirement to match the spaces though I also missed to see `|` in pattern as it will match `|` char which is less likely but still why have it, Thanks! – Pavneet_Singh Feb 17 '20 at 09:18

score 1 · Answer 2 · answered Feb 16 '20 at 01:56

While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:

Using split:

s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]

Or, using scan:

*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ')  # => "Josua de Grave"
birth # => "1643"
death # => "1712"

If I was using a pattern, I'd use this:

name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"

/(^[^*]+).+?(\d+)-(\d+)/ means:

^ start at the beginning of the buffer
([^*]+) capture everything not *, where it'll stop capturing
.+? skip the minimum until...
(\d+) the year is matched and captured
- match but don't capture
(\d+) the year is matched and captured

Regexper helps explain it as does Rubular.

`[^*]` will match anything till `*` so OP might not want to match everything so OP might not want this, likely only alphabets and spaces also the same with `.*+` which will have more widespread support for input. — Pavneet_Singh, Feb 17 '20 at 09:28

score 1 · Answer 3 · answered Feb 16 '20 at 03:02

r = /\*\s+\(|(?<=\d)\s*-\s*|\)/

"Josua de Grave* (1643-1712)".split r
  #=> ["Josua de Grave", "1643", "1712"] 

"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
  #=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]

The regular expression can be made self-documenting by writing it in free-spacing mode:

r = /
    \*\s+\(  # match '*' then >= 1 whitespaces then '('
    |        # or
    (?<=\d)  # match is preceded by a digit (positive lookbehind)
    \s*-\s*  # match >= 0 whitespaces then '-' then >= 0 whitespaces 
    |        # or
    \)       # match ')'
    /x       # free-spacing regex definition mode

The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

How to match different groups in regex

3 Answers3