2

I have the following string:

"Josua de Grave* (1643-1712)"

Everything before the * is the person's name, the first date 1634 is his birth date, 1712 is the date of his death.

Following this logic I'd like to have 3 match groups for each one of the item. I tried

([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})
"Josua de Grave* (1643-1712)".match(/([a-zA-Z|\s]*)\* (\d{3,4})-(\d{3,4})/)

but that returns nil.

Why is my logic wrong, and what should I do to get the 3 intended match groups.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
David Geismar
  • 2,327
  • 2
  • 28
  • 64

3 Answers3

1

The additional brackets ( ) around the digit 1643-1712 values needs to be added in your regex pattern so use

([a-zA-Z\s]*)\* \((\d{3,4})-(\d{3,4})\)
//               ^^                   ^^

since brackets represents the captured group so escape them using \ to match them as a character.

Pavneet_Singh
  • 34,557
  • 5
  • 43
  • 59
  • Be careful using `\s` as it matches more than a space: `/[ \t\r\n\f\v]/`. Also, it's not necessary to use the alternate `|` inside `[``]`. – the Tin Man Feb 16 '20 at 01:37
  • @theTinMan `\s` is the requirement to match the spaces though I also missed to see `|` in pattern as it will match `|` char which is less likely but still why have it, Thanks! – Pavneet_Singh Feb 17 '20 at 09:18
1

While you can use a pattern, the problem of splitting this into its parts can also be easily done using other Ruby methods:

Using split:

s = "Josua de Grave* (1643-1712)"
name, dates = s.split('*') # => ["Josua de Grave", " (1643-1712)"]
birth, death = dates[2..-2].split('-') # => ["1643", "1712"]

Or, using scan:

*name, birth, death = s.scan(/[[:alnum:]]+/) # => ["Josua", "de", "Grave", "1643", "1712"]
name.join(' ')  # => "Josua de Grave"
birth # => "1643"
death # => "1712"

If I was using a pattern, I'd use this:

name, birth, death = /^([^*]+).+?(\d+)-(\d+)/.match(s)[1..3] # => ["Josua de Grave", "1643", "1712"]
name # => "Josua de Grave"
birth # => "1643"
death # => "1712"

/(^[^*]+).+?(\d+)-(\d+)/ means:

  • ^ start at the beginning of the buffer
  • ([^*]+) capture everything not *, where it'll stop capturing
  • .+? skip the minimum until...
  • (\d+) the year is matched and captured
  • - match but don't capture
  • (\d+) the year is matched and captured

Regexper helps explain it as does Rubular.

the Tin Man
  • 150,910
  • 39
  • 198
  • 279
1
r = /\*\s+\(|(?<=\d)\s*-\s*|\)/

"Josua de Grave* (1643-1712)".split r
  #=> ["Josua de Grave", "1643", "1712"] 

"Sir Winston Leonard Spencer-Churchill* (1874 - 1965)".split r
  #=> ["Sir Winston Leonard Spencer-Churchill", "1874", "1965"]

The regular expression can be made self-documenting by writing it in free-spacing mode:

r = /
    \*\s+\(  # match '*' then >= 1 whitespaces then '('
    |        # or
    (?<=\d)  # match is preceded by a digit (positive lookbehind)
    \s*-\s*  # match >= 0 whitespaces then '-' then >= 0 whitespaces 
    |        # or
    \)       # match ')'
    /x       # free-spacing regex definition mode

The positive lookbehind is needed to avoid splitting hyphenated names on hyphens. (The positive lookahead (?=\d), placed after \s*-\s*, could be used instead.)

Cary Swoveland
  • 94,081
  • 5
  • 54
  • 87