2

I'm trying to do a somewhat complicated task in R.

I have a data frame with (for simplicity's sake) three columns.

Column 1 is a string.
Column 2 is an integer.
Column 3 is an integer.

I want to take all the observations which contain a certain substring in Column 1 AND that have an exact value for Column 2, and replace the third Column with the number 1.

That is, I have the following dataframe:

x <- data.frame(x1 = c("bob","jane","bob","bobby","bob","jane","bobby","bob","jane","bob"),
                x2 = c(1,1,1,1,1,2,2,2,2,2),
                x3 = c(13,22,3,34,10,23,53,42,13,35))

And, I want to select observations where Column 1 contains bob and Column 2==1, and change the third column to 1, so that I end up with:

y1 <- c("bob","jane","bob","bobby","bob","jane","bobby","bob","jane","bob")
y2 <- c(1,1,1,1,1,2,2,2,2,2)
y3 <- c(1,22,1,1,1,23,53,42,13,35)
y <- data.frame(y1,y2,y3)

I want to do this across a really, really big dataset. It is not feasible to split up the dataset and put it back together. I have tried using grep, but it's not working when I try to do both matches at once. Also, I have tried subsetting, but then I'd have to split apart the dataframe and put it back together. Thanks very much in advance.

oguz ismail
  • 34,491
  • 11
  • 33
  • 56
ejn
  • 303
  • 4
  • 12
  • `grepl` is often more useful because it supports combinations of logical tests with vectorized `&` and `|`. – IRTFM Nov 10 '15 at 05:06

5 Answers5

5

With R's capacity for logical indexing using the [<- function, this is really quite easy:

> x$x3[ grepl("bob", x$x1) & x$x2 == 1] <- 1
> x
      x1 x2 x3
1    bob  1  1
2   jane  1 22
3    bob  1  1
4  bobby  1  1
5    bob  1  1
6   jane  2 23
7  bobby  2 53
8    bob  2 42
9   jane  2 13
10   bob  2 35

To read the code you should see it as: "for every line of x where column 'x1' has "bob' and column 'x2' is equal to 1 ,... you assign the value 1 to column 'x3'." If you wanted to have a new object with that value, you could make a copy of x with y <- x and working on that instead.

thelatemail
  • 81,120
  • 12
  • 111
  • 172
IRTFM
  • 240,863
  • 19
  • 328
  • 451
  • Something similar with `ifelse` : `ifelse((x$x2 == 1 & grepl("bob", x$x1)), 1, x$x3)` – Ronak Shah Nov 10 '15 at 03:49
  • @RonakShah `ifelse` isn't so great as a general strategy for replacement. Try `ifelse(c(TRUE,FALSE), Sys.Date()+1, Sys.Date()) # doesn't return dates` – Frank Nov 10 '15 at 04:27
  • @Frank Ohh..In which case should we avoid to use `ifelse` then? and what about the run time? is it the same or `ifelse` is slow? – Ronak Shah Nov 10 '15 at 04:35
  • @RonakShah Yeah, it's generally slow http://stackoverflow.com/q/16275149/1191259 but nonetheless sometimes more convenient than the alternatives. I try to avoid it. – Frank Nov 10 '15 at 04:37
  • @Frank Thank you so much. That is helpful! – Ronak Shah Nov 10 '15 at 04:40
1
x1   <- c("bob","jane","bob","bobby","bob","jane","bobby","bob","jane","bob")
x2   <- c(1,1,1,1,1,2,2,2,2,2)
x3   <- c(13,22,3,34,10,23,53,42,13,35)
x    <- data.frame(x1,x2,x3)

rows <- grepl("bob", x[,1]) & x[,2] == 1 # Logical Selector Of Rows

x[rows, 3] <- 1

except for notation, the main diff between my answer and @42-'s answer is that my logical-selector-of-rows phrase operates only on the dataframe, whereas @42- references one of the original vectors (likely by oversight).

      x1 x2 x3
1    bob  1  1
2   jane  1 22
3    bob  1  1
4  bobby  1  1
5    bob  1  1
6   jane  2 23
7  bobby  2 53
8    bob  2 42
9   jane  2 13
10   bob  2 35
  • There really is no material difference between `x[rows, 3]` and `x$x3[rows]` . – IRTFM Nov 10 '15 at 05:00
  • .... just style/readability. However, I was referring to the diff between x2 and x$x2 (which in the meantime you have corrected in your answer) –  Nov 10 '15 at 05:20
  • Yeah. Somebody already fixed that oversight. And I edited the question to remove the unnecessary vectors. – IRTFM Nov 10 '15 at 05:28
1

There is a nice answer from user akrun using the dplyr package to a similar problem here and a faster variant from user docendo discimus here. In your case, the code would be :

x %>% mutate(x3 = replace(x3, x1 == 'bob' & x2 == 1, 1))

or

x %>% mutate(x3 = replace(x3, which(x1 == 'bob' & x2 == 1), 1))

If you want to update x directly, you could combine with the %<>% operator from the magrittr package as:

x %<>% mutate(x3 = replace(x3, x1 == 'bob' & x2 == 1, 1))

Community
  • 1
  • 1
meriops
  • 774
  • 5
  • 6
  • How would you do it for strings that *include* 'bob', such as 'bobby' in the example? Thanks. – ejn Nov 11 '15 at 07:23
  • Then, I'd call `grepl` (which returns a logical) to the rescue: x %<>% mutate(x3 = replace(x3, grepl('bob', x1) & x2 == 1, 1)) – meriops Nov 11 '15 at 13:16
1

You can also use data.table package for high performance:

library(data.table)
setDT(x)[grepl('bob', x1) & x2==1, x3:=1][]

#       x1 x2 x3
# 1:   bob  1  1
# 2:  jane  1 22
# 3:   bob  1  1
# 4: bobby  1  1
# 5:   bob  1  1
# 6:  jane  2 23
# 7: bobby  2 53
# 8:   bob  2 42
# 9:  jane  2 13
#10:   bob  2 35
Colonel Beauvel
  • 28,120
  • 9
  • 39
  • 75
0

In case you are looking out for the sub-string grepcan be used in that case along with the 'ifelse' function. With x as the data frame and wanting x$x3 to be changed do the following,

1: Find those with the required sub-string

req_sub<-grep("bob",x$x1)

This will return all the positions of x3 where there is a match in the sub-string

2: Make the changes in column x3

x$x3[grep("bob",x$x1)]<-ifelse(x[req_sub,]$x2==1,1,x[req_sub,]$x3)

The following will be the output

      x1 x2 x3
1    bob  1  1
2   jane  1 22
3    bob  1  1
4  bobby  1  1
5    bob  1  1
6   jane  2 23
7  bobby  2 53
8    bob  2 42
9   jane  2 13
10   bob  2 35
Murugesan
  • 131
  • 4