how to subtract a part of a string into a new variable in my dataframe?

Question

i need i way to subtract a part of a string from one variable to create a new variable

i have e.g.

df <- c(" 3 Rue d Argentine 16th arr 75116 Paris France", 
"5 Passage Ruelle 18th arr 75018 Paris France", " 1 Avenue Carnot 17th arr 75017 Paris France", "Bis Rue De Vaugirard 6th arr 75006 Paris France", "6 Impasse Marteau 18th arr 75018 Paris France" ," 1 Place De La Sorbonne 5th arr 75005 Paris France", "1 Place Vend me 1st arr 75001 Paris France")

what i want to have is a new variable that subtract the arrondissements so my new dataframe becomes

adress: " 3 Rue d Argentine 16th arr 75116 Paris France", 
"5 Passage Ruelle 18th arr 75018 Paris France", " 1 Avenue Carnot 17th arr 75017 Paris France", "Bis Rue De Vaugirard 6th arr 75006 Paris France", "6 Impasse Marteau 2nd arr 75018 Paris France" ," 1 Place De La Sorbonne 5th arr 75005 Paris France", "1 Place Vend me 1st arr 75001 Paris France"

arr: "16th", "18th", "17th", "6th", "2nd", "5th", "1st"

etc. can anybody help me on how to do this in R ?

@WiktorStribiżew, while I don't doubt that there are applicable dupes for this, *that* dupe is about removing the `st|nd|rd|th` ordinal from a pattern, not for extracting the whole number+ordinal. While you or I could likely adapt those removal regexes to preserve it as well ... I don't get the idea that this OP knows enough regex to be able to adapt that. — r2evans, Nov 24 '20 at 14:26

r2evans · Answer 1 · 2020-11-24T14:21:26.103

A base R method could be:

unlist(regmatches(df, gregexpr("\\b(\\S+)(?=\\sarr)", df, perl=TRUE)))
# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st"

Using gsub might be a mistake here, because if arr is not found then it will return the whole string.

If you'd rather use stringr, then

stringr::str_extract(df, "\\b(\\S+)(?=\\sarr)")
# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st"

Both regexes utilize "lookahead". The pattern broken down:

\\b word boundary; this does not include any characters, it ensures that the pattern to its left and/or right have some blank-space here
(\\S+) one or more (+) non-blank-space characters (\\S)
(?=\\sarr) is a lookahead that ensures the enclosed text (\\s is a blank-space, and the literal arr) is found after the desired pattern, but it is not "consumed"; using this requires perl=TRUE

With the base R version, gregexpr returns a list of indices where the pattern is found within each element of the input (df), and can be used for extraction (as in here) or even replacement (`regmatches<-`).

score 0 · Answer 2 · answered Nov 24 '20 at 13:41

Does this work using positive lookahead:

library(dplyr)
library(stringr)
df %>% mutate(arr = str_extract(address, '\\d+..(?=\\sarr)'))
                                             address  arr
1      3 Rue d Argentine 16th arr 75116 Paris France 16th
2       5 Passage Ruelle 18th arr 75018 Paris France 18th
3        1 Avenue Carnot 17th arr 75017 Paris France 17th
4    Bis Rue De Vaugirard 6th arr 75006 Paris France  6th
5      6 Impasse Marteau 18th arr 75018 Paris France 18th
6  1 Place De La Sorbonne 5th arr 75005 Paris France  5th
7         1 Place Vend me 1st arr 75001 Paris France  1st

score 0 · Answer 3 · answered Nov 24 '20 at 13:52

0

Here is a good option:

library(stringr)
str_extract(df,"[0-9]+(th|nd|st)")

# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st"

answered Nov 24 '20 at 13:52

Marcos Pérez

1,210
7

how to subtract a part of a string into a new variable in my dataframe?

3 Answers3