1

I'm trying to write a code that will exclude certain factors from sets of data/numbers.

I have written the following:

x <- c("1407741214DAG359", "2211682828DAG359", "1304410201DAG359", "0908700465DAG36", "0909700565G379")

y <- c("1407741214DAG359", "2211682828DAG359", "1304410201DAG359", "0","0")

Here i wish to exclude the values that contain DAG36 and G379

I tried writing the following:

newdata.x <- x[ x != "DAG36", "G379" ]

However, the code only seems to exclude values that exclusively contains: DAG36 and G379 and not any value containing either DAG36 or G379.

Would any of you be able to help me?

h3rm4n
  • 3,871
  • 13
  • 21

1 Answers1

3

What you are searching for is grep() or grepl(). Both functions search for a pattern in a given string or vector of strings, in your case.

The pattern you are looking for is DAG36 and G379. You can express this in regular expressions like DAG36|G379.

grep("DAG36|G379", x)
# [1] 4 5 

grepl("DAG36|G379", x)
# [1] FALSE FALSE FALSE TRUE TRUE

As you see, these two functions come down to the same thing, really, and can be used interchangeably. Now you can use indexing to replace the relevant strings with a zero:

x[ grepl("DAG36|G379", x) ] <- 0

x <- x[ grepl("DAG36|G379", x) ]                    # Easier version of removing relevant strings
x <- grep("DAG36|G379", x, invert = T, value = T)   # More direct version 
KenHBS
  • 5,620
  • 6
  • 30
  • 42
  • 1
    More robust to use `TRUE` and `FALSE` rather than `F` and `T`. There is nothing stopping a person from writing `T – lmo Sep 03 '17 at 13:58