0

I have a list of files that is provided to me by a third party. I am trying to extract the age group name from each filename. Unfortunately, the third party has a poor and inconsistent naming convention for their files and I'm writing a larger piece of code that consumes these files. This age group string that I'm trying to extract always appears before the ".xls" file extension and follows either an underscore or a space. I have tried a number of different regular expressions to do this in R, but I can't seem to figure this out (I'm not great with regex obviously).

age_group <- c("abc_July2018_Dec2018__state_1864.xls",
                       "def_July2018_Dec2018__state_65.xls",
                       "ghi July2018 Dec2018 state overall.xls")

The output I'm expecting is a vector containing: "1864", "65", "overall".

Can someone help me with the R regular expression to do extract these groups?

StatsStudent
  • 911
  • 2
  • 8
  • 22
  • 2
    Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. – Wiktor Stribiżew Sep 23 '20 at 15:21
  • 1
    Questions that ask ["Give me a regex that does X"](https://meta.stackoverflow.com/q/285733) with no attempt are off topic on Stack Overflow. Also, see [Why is “Can someone help me?” not an actual question?](https://meta.stackoverflow.com/questions/284236) – Wiktor Stribiżew Sep 23 '20 at 15:21
  • 3
    I disagree and voted to reopen. You can't answer "learn regex" to every regex question. The answers here are complicated, learning this level of regex from a general tutorial for a user focused on R, not regex, will take hours. On the other hand the question is well formulated with good keywords and users with a similar issue are likely to benefit from it. – Moody_Mudskipper Sep 25 '20 at 21:03
  • Moreover while pure regex is a good way, probably the most efficient, to solve this issue, there might be some other solutions in R, maybe an existing dedicated function in a package, i believe redirecting to a generic regex post is wrong. – Moody_Mudskipper Sep 25 '20 at 21:09

2 Answers2

1

Or use str_extractfrom the package stringr:

str_extract(age_group, "(?<=_| )[^_ ]+(?=\\.xls)")
[1] "1864"    "65"      "overall"

This makes use of positive lookbehind in (?<=_| ), which can be glossed as "match if you see ... on the left", namely either _ or and of positive lookahead in (?=\\.xls), glossable as "match if you see ... on the right", namely .followed by xls. Based on these restrictions to the left and the right, the regex matches any character once or more times that is neither _ or a whitespace .

Chris Ruehlemann
  • 10,258
  • 2
  • 9
  • 18
-1

Using gsub.

gsub(".*(_|\\s)(.*).xls", "\\2", age_group)
# [1] "1864"    "65"      "overall"
jay.sf
  • 33,483
  • 5
  • 39
  • 75