70

I'm porting a library from Ruby to Go, and have just discovered that regular expressions in Ruby are not compatible with Go (google RE2). It's come to my attention that Ruby & Java (plus other languages use PCRE regular expressions (perl compatible, which supports capturing groups)), so I need to re-write my expressions so that they compile ok in Go.

For example, I have the following regex:

`(?<Year>\d{4})-(?<Month>\d{2})-(?<Day>\d{2})`

This should accept input such as:

2001-01-20

The capturing groups allow the year, month and day to be captured into variables. To get the value of each group, it's very easy; you just index into the returned matched data with the group name and you get the value back. So, for example to get the year, something like this pseudo code:

m=expression.Match("2001-01-20")
year = m["Year"]

This is a pattern I use a lot in my expressions, so I have a lot of re-writing to do.

So, is there a way to get this kind of functionality in Go regexp; how should I re-write these expressions?

Flimzy
  • 60,850
  • 13
  • 104
  • 147
Plastikfan
  • 2,722
  • 5
  • 32
  • 47

8 Answers8

93

how should I re-write these expressions?

Add some Ps, as defined here:

(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})

Cross reference capture group names with re.SubexpNames().

And use as follows:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
    fmt.Printf("%#v\n", r.FindStringSubmatch(`2015-05-27`))
    fmt.Printf("%#v\n", r.SubexpNames())
}
spoulson
  • 20,523
  • 14
  • 72
  • 101
thwd
  • 21,419
  • 4
  • 68
  • 99
  • 1
    Ok great that looks encouraging, but how would I get access to the individual values, year, month and day? – Plastikfan May 27 '15 at 13:31
  • Forget that last comment, I just found that answer. Its all in the ?P, as you say :) – Plastikfan May 27 '15 at 13:40
  • I'm still confused by this; I'm not sure they are addressable by Year, Month, etc. I get back an array with four values and can index into it, but that's it. – Kevin Burke Feb 09 '16 at 23:21
  • @KevinBurke see the example in [`Regexp.SubexpNames`](https://golang.org/pkg/regexp/#example_Regexp_SubexpNames) – thwd Feb 10 '16 at 10:40
  • @thwd Now, this is begging a question: what should happen if you named two groups in the same way? This is not a well-defined behavior, but the regex compiler doesn't complain about it. Your code throws away the first match for example, but I can imagine situations where it would make sense to throw all but first, or maybe collect all of them... Designing a language has lots of subtleties... – wvxvw Jul 18 '16 at 15:55
  • @wvxvw (?Pgroup) syntax was first introduced by Python re module. It is not Go specific syntax. [Read more](http://www.regular-expressions.info/named.html) – Vladimir Bauer Nov 29 '16 at 14:11
  • 4
    @VladimirBauer I'm not sure of what you are getting at. I know it's not specific to Go, I'm arguing that specifically in Go, the built-in library implementation of this feature is bad because it duplicates another simpler feature of this library, but with an additional meaningless syntactical element. – wvxvw Nov 30 '16 at 08:39
  • @wvxvw this is well-defined behavior now: https://golang.org/pkg/regexp/#Regexp.SubexpIndex but still without the additional possibilities you mentioned. – Eric Lindsey Sep 08 '20 at 07:36
28

I had created a function for handling url expressions but it suits your needs too. You can check this snippet but it simply works like this:

/**
 * Parses url with the given regular expression and returns the 
 * group values defined in the expression.
 *
 */
func getParams(regEx, url string) (paramsMap map[string]string) {

    var compRegEx = regexp.MustCompile(regEx)
    match := compRegEx.FindStringSubmatch(url)

    paramsMap = make(map[string]string)
    for i, name := range compRegEx.SubexpNames() {
        if i > 0 && i <= len(match) {
            paramsMap[name] = match[i]
        }
    }
    return paramsMap
}

You can use this function like:

params := getParams(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`, `2015-05-27`)
fmt.Println(params)

and the output will be:

map[Year:2015 Month:05 Day:27]
eluleci
  • 3,289
  • 1
  • 23
  • 24
16

To improve RAM and CPU usage without calling anonymous functions inside loop and without copying arrays in memory inside loop with "append" function see the next example:

You can store more than one subgroup with multiline text, without appending string with '+' and without using for loop inside for loop (like other examples posted here).

txt := `2001-01-20
2009-03-22
2018-02-25
2018-06-07`

regex := *regexp.MustCompile(`(?s)(\d{4})-(\d{2})-(\d{2})`)
res := regex.FindAllStringSubmatch(txt, -1)
for i := range res {
    //like Java: match.group(1), match.gropu(2), etc
    fmt.Printf("year: %s, month: %s, day: %s\n", res[i][1], res[i][2], res[i][3])
}

Output:

year: 2001, month: 01, day: 20
year: 2009, month: 03, day: 22
year: 2018, month: 02, day: 25
year: 2018, month: 06, day: 07

Note: res[i][0] =~ match.group(0) Java

If you want to store this information use a struct type:

type date struct {
  y,m,d int
}
...
func main() {
   ...
   dates := make([]date, 0, len(res))
   for ... {
      dates[index] = date{y: res[index][1], m: res[index][2], d: res[index][3]}
   }
}

It's better to use anonymous groups (performance improvement)

Using "ReplaceAllGroupFunc" posted on Github is bad idea because:

  1. is using loop inside loop
  2. is using anonymous function call inside loop
  3. has a lot of code
  4. is using the "append" function inside loop and that's bad. Every time a call is made to "append" function, is copying the array to new memory position
VasileM
  • 486
  • 4
  • 13
  • Yes, there is a better and worse solution if you consider wasted clock cycles, wasted RAM, etc. With modesty you would let a farmer publish code in production. – VasileM Aug 18 '19 at 16:48
5

Simple way to determine group names based on @VasileM answer.

Disclaimer: it's not about memory/cpu/time optimization

package main

import (
    "fmt"
    "regexp"
)

func main() {
    r := regexp.MustCompile(`^(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})$`)

    res := r.FindStringSubmatch(`2015-05-27`)
    names := r.SubexpNames()
    for i, _ := range res {
        if i != 0 {
            fmt.Println(names[i], res[i])
        }
    }
}

https://play.golang.org/p/Y9cIVhMa2pU

spiil
  • 598
  • 3
  • 7
  • 19
2

If you need to replace based on a function while capturing groups you can use this:

import "regexp"

func ReplaceAllGroupFunc(re *regexp.Regexp, str string, repl func([]string) string) string {
    result := ""
    lastIndex := 0

    for _, v := range re.FindAllSubmatchIndex([]byte(str), -1) {
        groups := []string{}
        for i := 0; i < len(v); i += 2 {
            groups = append(groups, str[v[i]:v[i+1]])
        }

        result += str[lastIndex:v[0]] + repl(groups)
        lastIndex = v[1]
    }

    return result + str[lastIndex:]
}

Example:

str := "abc foo:bar def baz:qux ghi"
re := regexp.MustCompile("([a-z]+):([a-z]+)")
result := ReplaceAllGroupFunc(re, str, func(groups []string) string {
    return groups[1] + "." + groups[2]
})
fmt.Printf("'%s'\n", result)

https://gist.github.com/elliotchance/d419395aa776d632d897

Elliot Chance
  • 4,740
  • 7
  • 39
  • 65
2

As of GO 1.15, you can simplify the process by using Regexp.SubexpIndex. You can check the release notes at https://golang.org/doc/go1.15#regexp.

Based in your example, you'd have something like the following:

re := regexp.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
matches := re.FindStringSubmatch("Some random date: 2001-01-20")
yearIndex := re.SubexpIndex("Year")
fmt.Println(matches[yearIndex])

You can check and execute this example at https://play.golang.org/p/ImJ7i_ZQ3Hu.

0

You can use regroup library for that https://github.com/oriser/regroup

Example:

package main

import (
    "fmt"
    "github.com/oriser/regroup"
)

func main() {
    r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
    mathces, err := r.Groups("2015-05-27")
    if err != nil {
        panic(err)
    }
    fmt.Printf("%+v\n", mathces)
}

Will print: map[Year:2015 Month:05 Day:27]

Alternatively, you can use it like this:

package main

import (
    "fmt"
    "github.com/oriser/regroup"
)

type Date struct {
    Year   int `regroup:"Year"`
    Month  int `regroup:"Month"`
    Day    int `regroup:"Day"`
}

func main() {
    date := &Date{}
    r := regroup.MustCompile(`(?P<Year>\d{4})-(?P<Month>\d{2})-(?P<Day>\d{2})`)
    if err := r.MatchToTarget("2015-05-27", date); err != nil {
        panic(err)
    }
    fmt.Printf("%+v\n", date)
}

Will print: &{Year:2015 Month:5 Day:27}

Ori Seri
  • 789
  • 1
  • 5
  • 13
0

Function for get regexp parameters wit nil pointer checking. Returns map[] if error ocured

// GetRxParams - Get all regexp params from string with provided regular expression
func GetRxParams(rx *regexp.Regexp, str string) (pm map[string]string) {
    if !rx.MatchString(str) {
        return nil
    }
    p := rx.FindStringSubmatch(str)
    n := rx.SubexpNames()
    pm = map[string]string{}
    for i := range n {
        if i == 0 {
            continue
        }

        if n[i] != "" && p[i] != "" {
            pm[n[i]] = p[i]
        }
    }
    return
}
derv-dice
  • 40
  • 1
  • 5