2

I am trying to solve a little dilem with a data frame in R. I have the next data frame DF with this structure(I add dput() version in the final part):

Key M1 M2 M3 M4 M5 M6 M7
001  1 NA NA  1 NA NA  1
002 NA NA  1 NA NA  1 NA
003 NA NA NA  1  1 NA  1
004 NA NA  1 NA NA NA  1
005 NA NA NA  1 NA NA  1
006  1 NA NA NA NA NA NA
007 NA NA  1 NA NA NA  1

This data frame has and ID column (Key) and several columns with NA and 1. I want to fill each row that contains NA in the next pattern: When there are two NA and a previous 1 and a last 1 after the two NA, then both NA have to be filled with 1. If the pattern is not found, the elements in the rows must keep their original form. For example, in the first row there are two patterns: 1 NA NA 1 and 1 NA NA 1, then NA must be filled with 1. I would like to get a result like this:

Key M1 M2 M3 M4 M5 M6 M7
001  1  1  1  1  1  1  1
002 NA NA  1  1  1  1 NA
003 NA NA NA  1  1 NA  1
004 NA NA  1 NA NA NA  1
005 NA NA NA  1  1  1  1
006  1 NA NA NA NA NA NA
007 NA NA  1 NA NA NA  1

Where the patterns have been filled with 1. I have tried using na.locf() from zoo package, but it changes the rest of NAs in DF. The dput() version of DF is the next:

structure(list(Key = c("001", "002", "003", "004", "005", "006", 
"007"), M1 = c(1, NA, NA, NA, NA, 1, NA), M2 = c(NA, NA, NA, 
NA, NA, NA, NA), M3 = c(NA, 1, NA, 1, NA, NA, 1), M4 = c(1, NA, 
1, NA, 1, NA, NA), M5 = c(NA, NA, 1, NA, NA, NA, NA), M6 = c(NA, 
1, NA, NA, NA, NA, NA), M7 = c(1, NA, 1, 1, 1, NA, 1)), .Names = c("Key", 
"M1", "M2", "M3", "M4", "M5", "M6", "M7"), row.names = c(NA, 
-7L), class = "data.frame")

Many thanks for your help !

Duck
  • 37,428
  • 12
  • 34
  • 70

3 Answers3

4

I would use run length encoding to find two NAs in a row, for each row:

i <- t(apply(is.na(df[-1]), 1, function(x){
    r <- rle(x)

    # 2 NAs in a row => lengths = 2 & value is is.na true 
    r$values = r$lengths == 2 & r$values 

    # don't modify first / last run of the series
    r$values[1] <- r$values[length(r$lengths)] <- FALSE 

    inverse.rle(r)
}))

df[-1][i] <- 1
df


#   Key M1 M2 M3 M4 M5 M6 M7
# 1 001  1  1  1  1  1  1  1
# 2 002 NA NA  1  1  1  1 NA
# 3 003 NA NA NA  1  1 NA  1
# 4 004 NA NA  1 NA NA NA  1
# 5 005 NA NA NA  1  1  1  1
# 6 006  1 NA NA NA NA NA NA
# 7 007 NA NA  1 NA NA NA  1

This way, you don't need to convert to a character and back.

Neal Fultz
  • 8,413
  • 36
  • 49
2
mat <- as.matrix(df[-1])
mat[is.na(mat)] <- 0L
strings <- apply(mat, 1, function(x) paste(x, collapse=""))
pattern_matches <- str_locate_all(strings, "(?=(1001))")
strloc <- sapply(pattern_matches, function(x) c(x[,"start"]))
for(i in seq(nrow(mat))) mat[i,strloc[[i]]+rep(1:2,each=length(strloc[[i]]))] <- 1
is.na(mat) <- mat==0
as.data.frame(cbind(df[1],mat))
#   Key M1 M2 M3 M4 M5 M6 M7
# 1 001  1  1  1  1  1  1  1
# 2 002 NA NA  1  1  1  1 NA
# 3 003 NA NA NA  1  1 NA  1
# 4 004 NA NA  1 NA NA NA  1
# 5 005 NA NA NA  1  1  1  1
# 6 006  1 NA NA NA NA NA NA
# 7 007 NA NA  1 NA NA NA  1

I took a similar approach as @ColonelBeauvel, but used the matrix structure instead. First assign 0 to all NA values. Then paste each row together. Match the pattern "1001" with str_locate_all from the stringr package. For anyone wondering, the advantage of the full pattern "(?=(1001))", is that it allows a portion of one match to be used in the next match, called a "non-consuming regular expression".

We then locate the start of each match and assign '1' to the two zeroes adjacent to it.

Pierre L
  • 26,748
  • 5
  • 39
  • 59
  • Really useful new trick for me that non consuming regular expression. – AntoniosK Sep 02 '15 at 18:21
  • Very useful. great explanation here http://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups – Pierre L Sep 02 '15 at 18:29
  • I'd definitely have used a second pattern matching attempt if there was a pattern match the first time, in order to capture 1s that belong to two consecutive matches. Two times is enough as the column number is 7 and the pattern length is 4. But with this trick no need for that. – AntoniosK Sep 02 '15 at 19:28
1

Here is an approach which does the job, surely can be improved without converting NA to 0 and use gsub with lookaround (but I am not a regex killer!).

m is your original data.frame:

x = do.call(paste, c(data.frame(1*!is.na(m[-1])), sep=''))

y = gregexpr('(?=1001)', x, perl=T)

func = function(u,v)
{
    if(any(u!=-1))
        sapply(u, function(i) substr(v, start=i, stop=i+2) <<- '111')

    as.numeric(strsplit(v,'')[[1]])
}

df = cbind(list('key'=m[,1]), data.frame(do.call(rbind, Map(func, y, x))))
df[df==0] = NA

#  key X1 X2 X3 X4 X5 X6 X7
#1 001  1  1  1  1  1  1  1
#2 002 NA NA  1  1  1  1 NA
#3 003 NA NA NA  1  1 NA  1
#4 004 NA NA  1 NA NA NA  1
#5 005 NA NA NA  1  1  1  1
#6 006  1 NA NA NA NA NA NA
#7 007 NA NA  1 NA NA NA  1
Colonel Beauvel
  • 28,120
  • 9
  • 39
  • 75