Extract string with regex in R before period and following an underscore or space

Question

I have a list of files that is provided to me by a third party. I am trying to extract the age group name from each filename. Unfortunately, the third party has a poor and inconsistent naming convention for their files and I'm writing a larger piece of code that consumes these files. This age group string that I'm trying to extract always appears before the ".xls" file extension and follows either an underscore or a space. I have tried a number of different regular expressions to do this in R, but I can't seem to figure this out (I'm not great with regex obviously).

age_group <- c("abc_July2018_Dec2018__state_1864.xls",
                       "def_July2018_Dec2018__state_65.xls",
                       "ghi July2018 Dec2018 state overall.xls")

The output I'm expecting is a vector containing: "1864", "65", "overall".

Can someone help me with the R regular expression to do extract these groups?

Looks like you are looking to create a regex, but do not know where to get started. Please check [Reference - What does this regex mean](https://stackoverflow.com/questions/22937618) resource, it has plenty of hints. Also, refer to [Learning Regular Expressions](https://stackoverflow.com/questions/4736) post for some basic regex info. Once you get some expression ready and still have issues with the solution, please edit the question with the latest details and we'll be glad to help you fix the problem. — Wiktor Stribiżew, Sep 23 '20 at 15:21
Questions that ask ["Give me a regex that does X"](https://meta.stackoverflow.com/q/285733) with no attempt are off topic on Stack Overflow. Also, see [Why is “Can someone help me?” not an actual question?](https://meta.stackoverflow.com/questions/284236) — Wiktor Stribiżew, Sep 23 '20 at 15:21
I disagree and voted to reopen. You can't answer "learn regex" to every regex question. The answers here are complicated, learning this level of regex from a general tutorial for a user focused on R, not regex, will take hours. On the other hand the question is well formulated with good keywords and users with a similar issue are likely to benefit from it. — Moody_Mudskipper, Sep 25 '20 at 21:03
Moreover while pure regex is a good way, probably the most efficient, to solve this issue, there might be some other solutions in R, maybe an existing dedicated function in a package, i believe redirecting to a generic regex post is wrong. — Moody_Mudskipper, Sep 25 '20 at 21:09

Chris Ruehlemann · Answer 1 · 2020-10-09T15:03:51.657

1

Or use str_extractfrom the package stringr:

str_extract(age_group, "(?<=_| )[^_ ]+(?=\\.xls)")
[1] "1864"    "65"      "overall"

This makes use of positive lookbehind in (?<=_| ), which can be glossed as "match if you see ... on the left", namely either _ or and of positive lookahead in (?=\\.xls), glossable as "match if you see ... on the right", namely .followed by xls. Based on these restrictions to the left and the right, the regex matches any character once or more times that is neither _ or a whitespace .

edited Oct 09 '20 at 15:03

answered Sep 23 '20 at 15:14

Chris Ruehlemann

10,258
2
9
18

The only problem with this solution, is that I can't count on the word "state" always appearing where you see it. Sometimes this word is replaced with other geographical areas of aggregation (e.g. "national" or "msa"). – StatsStudent Sep 23 '20 at 15:24
Well, you can use alternatives: `str_extract(age_group, "(?<=(state|national|msa)(_| ))\\w+(?=\\.xls)")` – Chris Ruehlemann Sep 23 '20 at 15:27
Thanks for your help. The only issue is that new aggregation levels could be created that I may not know of in advance. – StatsStudent Sep 23 '20 at 15:30
1

I've made a slight adjustment to the pattern. – Chris Ruehlemann Sep 23 '20 at 15:42
Ah. Thanks. I like this pattern much better. This works. Thanks so much, Chris. – StatsStudent Sep 23 '20 at 15:52
@StatsStudent You may accept this answer instead of mine, the regex is better. If you can't use `stringr` you may do `regmatches(age_group, regexpr("(?<=_| )[^_ ]+(?=\\.xls)", age_group, perl=T))` in base R. – jay.sf Oct 02 '20 at 22:25
1

@jay.sf Thanks for this endorsement! Greatly appreciated. – Chris Ruehlemann Oct 03 '20 at 08:50

score -1 · Accepted Answer · answered Sep 23 '20 at 15:12

-1

Using gsub.

gsub(".*(_|\\s)(.*).xls", "\\2", age_group)
# [1] "1864"    "65"      "overall"

answered Sep 23 '20 at 15:12

jay.sf

33,483
5
39
75

Extract string with regex in R before period and following an underscore or space

2 Answers2