-2

i need i way to subtract a part of a string from one variable to create a new variable

i have e.g.

df <- c(" 3 Rue d Argentine 16th arr 75116 Paris France", 
"5 Passage Ruelle 18th arr 75018 Paris France", " 1 Avenue Carnot 17th arr 75017 Paris France", "Bis Rue De Vaugirard 6th arr 75006 Paris France", "6 Impasse Marteau 18th arr 75018 Paris France" ," 1 Place De La Sorbonne 5th arr 75005 Paris France", "1 Place Vend me 1st arr 75001 Paris France") 

what i want to have is a new variable that subtract the arrondissements so my new dataframe becomes

adress: " 3 Rue d Argentine 16th arr 75116 Paris France", 
"5 Passage Ruelle 18th arr 75018 Paris France", " 1 Avenue Carnot 17th arr 75017 Paris France", "Bis Rue De Vaugirard 6th arr 75006 Paris France", "6 Impasse Marteau 2nd arr 75018 Paris France" ," 1 Place De La Sorbonne 5th arr 75005 Paris France", "1 Place Vend me 1st arr 75001 Paris France"

arr: "16th", "18th", "17th", "6th", "2nd", "5th", "1st"       

etc. can anybody help me on how to do this in R ?

r2evans
  • 77,184
  • 4
  • 55
  • 96
Cafi
  • 1
  • 1
    @WiktorStribiżew, while I don't doubt that there are applicable dupes for this, *that* dupe is about removing the `st|nd|rd|th` ordinal from a pattern, not for extracting the whole number+ordinal. While you or I could likely adapt those removal regexes to preserve it as well ... I don't get the idea that this OP knows enough regex to be able to adapt that. – r2evans Nov 24 '20 at 14:26

3 Answers3

0

A base R method could be:

unlist(regmatches(df, gregexpr("\\b(\\S+)(?=\\sarr)", df, perl=TRUE)))
# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st" 

Using gsub might be a mistake here, because if arr is not found then it will return the whole string.

If you'd rather use stringr, then

stringr::str_extract(df, "\\b(\\S+)(?=\\sarr)")
# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st" 

Both regexes utilize "lookahead". The pattern broken down:

  • \\b word boundary; this does not include any characters, it ensures that the pattern to its left and/or right have some blank-space here
  • (\\S+) one or more (+) non-blank-space characters (\\S)
  • (?=\\sarr) is a lookahead that ensures the enclosed text (\\s is a blank-space, and the literal arr) is found after the desired pattern, but it is not "consumed"; using this requires perl=TRUE

With the base R version, gregexpr returns a list of indices where the pattern is found within each element of the input (df), and can be used for extraction (as in here) or even replacement (`regmatches<-`).

r2evans
  • 77,184
  • 4
  • 55
  • 96
0

Does this work using positive lookahead:

library(dplyr)
library(stringr)
df %>% mutate(arr = str_extract(address, '\\d+..(?=\\sarr)'))
                                             address  arr
1      3 Rue d Argentine 16th arr 75116 Paris France 16th
2       5 Passage Ruelle 18th arr 75018 Paris France 18th
3        1 Avenue Carnot 17th arr 75017 Paris France 17th
4    Bis Rue De Vaugirard 6th arr 75006 Paris France  6th
5      6 Impasse Marteau 18th arr 75018 Paris France 18th
6  1 Place De La Sorbonne 5th arr 75005 Paris France  5th
7         1 Place Vend me 1st arr 75001 Paris France  1st
Karthik S
  • 7,798
  • 2
  • 6
  • 20
0

Here is a good option:

library(stringr)
str_extract(df,"[0-9]+(th|nd|st)")

# [1] "16th" "18th" "17th" "6th"  "18th" "5th"  "1st"
Marcos Pérez
  • 1,210
  • 7