10

I have a tibble with a list of words for each row. I want to create a new variable from a function that searches for a keyword and, if it finds the keyword, creates a string composed of the keyword plus-and-minus 3 words.

The code below is close, but, rather than grabbing all three words before and after my keyword, it grabs the single word 3 ahead/behind.

df <- tibble(words = c("it", "was", "the", "best", "of", "times", 
                       "it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times", 
                                    paste(lag(words, 3), 
                                          words, 
                                          lead(words, 3), sep = " "),
                                    NA))

The most intuitive solution would be if the lag function could do something like this: lead(words, 1:3) but that doesn't work.

Obviously I could pretty quickly do this by hand (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)), but I'll eventually actually want to be able to grab the keyword plus-and-minus 50 words--too much to hand-code.

Would be ideal if a solution existed in the tidyverse, but any solution would be helpful. Any help would be appreciated.

wscampbell
  • 177
  • 7

4 Answers4

7

One option would be sapply:

library(dplyr)

df %>%
  mutate(
    chunks = ifelse(
      words == "times",
      sapply(
        1:nrow(.),
        function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
        ),
      NA
      )
  )

Output:

# A tibble: 12 x 2
   words chunks                      
   <chr> <chr>                       
 1 it    NA                          
 2 was   NA                          
 3 the   NA                          
 4 best  NA                          
 5 of    NA                          
 6 times the best of times it was the
 7 it    NA                          
 8 was   NA                          
 9 the   NA                          
10 worst NA                          
11 of    NA                          
12 times the worst of times   

Although not an explicit lead or lag function, it can often serve the purpose as well.

arg0naut91
  • 12,684
  • 1
  • 11
  • 31
4

Similar to @arg0naut but without dplyr:

r  = 1:nrow(df)
w  = which(df$words == "times")
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))

df$chunks <- NA_character_
df$chunks[w] <- tapply(df$words[unlist(wm)], rep(w, lengths(wm)), FUN = paste, collapse=" ")

# A tibble: 12 x 2
   words chunks                      
   <chr> <chr>                       
 1 it    <NA>                        
 2 was   <NA>                        
 3 the   <NA>                        
 4 best  <NA>                        
 5 of    <NA>                        
 6 times the best of times it was the
 7 it    <NA>                        
 8 was   <NA>                        
 9 the   <NA>                        
10 worst <NA>                        
11 of    <NA>                        
12 times the worst of times      

The data.table translation:

library(data.table)
DT = data.table(df)

w = DT["times", on="words", which=TRUE]
wm = lapply(w, function(wi) intersect(r, seq(wi-3L, wi+3L)))

DT[w, chunks := DT[unlist(wm), paste(words, collapse=" "), by=rep(w, lengths(wm))]$V1]
Frank
  • 63,401
  • 8
  • 85
  • 161
4

data.table::shift accepts a vector for the n (lag) argument and outputs a list, so you can use that and do.call(paste the list elements together. However, unless you're on data.table version >= 1.12, I don't think it will let you mix negative and positive n values (as below).

With data table:

library(data.table)
setDT(df)

df[, chunks := trimws(ifelse(words != "times", NA, do.call(paste, shift(words, 3:-3, ''))))]

#     words                       chunks
#  1:    it                         <NA>
#  2:   was                         <NA>
#  3:   the                         <NA>
#  4:  best                         <NA>
#  5:    of                         <NA>
#  6: times the best of times it was the
#  7:    it                         <NA>
#  8:   was                         <NA>
#  9:   the                         <NA>
# 10: worst                         <NA>
# 11:    of                         <NA>
# 12: times           the worst of times

With dplyr and only using data.table for the shift function:

library(dplyr)

df %>% 
  mutate(chunks = do.call(paste, data.table::shift(words, 3:-3, fill = '')),
         chunks = trimws(ifelse(words != "times", NA, chunks)))

# # A tibble: 12 x 2
#    words chunks                      
#    <chr> <chr>                       
#  1 it    NA                          
#  2 was   NA                          
#  3 the   NA                          
#  4 best  NA                          
#  5 of    NA                          
#  6 times the best of times it was the
#  7 it    NA                          
#  8 was   NA                          
#  9 the   NA                          
# 10 worst NA                          
# 11 of    NA                          
# 12 times the worst of times         
IceCreamToucan
  • 23,575
  • 2
  • 13
  • 25
4

Here is a another tidyverse solution using lag and lead

laglead_f <- function(what, range)
    setNames(paste(what, "(., ", range, ", default = '')"), paste(what, range))

df %>%
    mutate_at(vars(words), funs_(c(laglead_f("lag", 3:0), laglead_f("lead", 1:3)))) %>%
    unite(chunks, -words, sep = " ") %>%
    mutate(chunks = ifelse(words == "times", trimws(chunks), NA))
## A tibble: 12 x 2
#   words chunks
#   <chr> <chr>
# 1 it    NA
# 2 was   NA
# 3 the   NA
# 4 best  NA
# 5 of    NA
# 6 times the best of times it was the
# 7 it    NA
# 8 was   NA
# 9 the   NA
#10 worst NA
#11 of    NA
#12 times the worst of times

The idea is to store values from the three lagged and leading vectors in new columns with mutate_at and a named function, unite those columns and then filter entries based on your condition where words == "times".

Maurits Evers
  • 42,255
  • 4
  • 27
  • 51