1

i'm looking for a way to quickly perform in-place subsetting of my dataframes in R.

say i have n dfs of varying lengths, all containing 3 columns: chr, start, end describing genomic regions altogether.

I created a vector of the legitimate 'chr' labels for that matter, let's say ["chr1", "chr2", ... , "chr12"].

some (or all) of my dfs has rows which contains invalid chromosomal names (e.g. "chr4_UG5432_random" or "chrX". the names doesn't really matter - just that they don't appear in my vector of "valid" labels), and I want to efficiently filter out rows with this invalid labels.

so the best solution I found so far is putting them all in a list, and using lapply on them with

subset(df,chr %in% c(paste0("chr",1:12)))

and I understand that afterwards I can use the functions list2env to retrieve the variables "holding" the modified dfs.

but i'm sure there is a much simpler way to perform this filtering in-place for each dataframe, without having to throw them in a list. any help is appreciated!

Tom Gome
  • 11
  • 2
  • 2
    Please provide a complete minimal reproducible example including code, input and expected output as requested at the top of the [tag:r] tag page. – G. Grothendieck May 23 '21 at 14:27
  • Can `chr4_UG5432_random` be in any of the columns or in a certain column? And is `chr4_UG5432_random` part of a string or is a value in the column? – TarJae May 23 '21 at 14:37
  • 1
    Actually, per this [canonical answer](https://stackoverflow.com/a/24376207/1422451): *best practice is to avoid having a bunch of [similarly structured] data.frames not in a list*! – Parfait May 23 '21 at 14:39
  • replying to you @TarJae: the `chr4_UG5432_random` is just a made up example for rows I currently wish to avoid, and they may appear *only* in the 'chr' column. the other two (start & end) are numrical. – Tom Gome May 23 '21 at 14:47

1 Answers1

0

Update You can store all your desired strings to filter in a dataframe like for example:

df_to_avoid <- data.frame(chr = c("chr4_XXX", "chr4_YYY", "chr3ZZZ"))

then use:

dplyr::filter(df1, !chr %in% df_to_avoid$chr)

So you can filter by multiple strings!! data new:

df1 <- tribble( 
  ~chr, ~start, ~end,
  "chr1", 10, 20,
  "chr1", 1, 10,
  "chr2", 20, 30, 
  "chr3ZZZ", 4, 16, 
"chr4_XXX", 324, 343)

output new:

  chr   start   end
  <chr> <dbl> <dbl>
1 chr1     10    20
2 chr1      1    10
3 chr2     20    30

First answer

We could use filter with negative grepl by !

dplyr::filter(df1, !grepl('chr4_XXX', chr))

Output:

  chr   start   end
  <chr> <dbl> <dbl>
1 chr1     10    20
2 chr1      1    10
3 chr2     20    30
4 chr3      4    16

data:

df1 <- tribble( 
  ~chr, ~start, ~end,
  "chr1", 10, 20,
  "chr1", 1, 10,
  "chr2", 20, 30, 
  "chr3", 4, 16, 
"chr4_XXX", 324, 343)
TarJae
  • 8,026
  • 2
  • 8
  • 25