1

I'm new to this, so be kind :)

I use the tidyverse package in R.

I have a list of dataframes. In each dataframe, I want to keep only the rows above the first row that has a certain string (in this case, three asterisks) in its first column. In the example attached, I want to keep all the rows above row 21 (i.e. first time "***" is encountered in first column). How do I do that?

dataframe example

r2evans
  • 77,184
  • 4
  • 55
  • 96
Tony D
  • 341
  • 5
  • 15
  • Please include *sample data*, not an image of sample data. Perhaps `dput(many_files[[1]][c(1:5,20:23),])` would work here. – r2evans Jun 26 '17 at 20:59
  • how do I do that? i.e. export data frame so I can attach it here? – Tony D Jun 26 '17 at 22:09
  • Take a look at [reproducible examples](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and the help page for SO, [minimal examples](https://stackoverflow.com/help/mcve). My first comment specifically used one of the recommendations. (One key point here is that it should be a *minimal* yet *representative* portion of your data. If I have to scroll through pages of raw data, you're likely doing it wrong (and you'll often be ignored as "too much effort"). – r2evans Jun 26 '17 at 22:19

2 Answers2

2

I don't know that tidyverse contains just the right function for this, but base R can handle it (and therefore it can be included in a pipe).

Some sample data:

dat <- data.frame(Cycle = c(1:5,20,"***",21,22),
                  Time  = Sys.time() + 1:9,
                  stringsAsFactors = FALSE)
dat
#   Cycle                Time
# 1     1 2017-06-26 14:02:48
# 2     2 2017-06-26 14:02:49
# 3     3 2017-06-26 14:02:50
# 4     4 2017-06-26 14:02:51
# 5     5 2017-06-26 14:02:52
# 6    20 2017-06-26 14:02:53
# 7   *** 2017-06-26 14:02:54
# 8    21 2017-06-26 14:02:55
# 9    22 2017-06-26 14:02:56


dat[! cumany(grepl("\\*\\*\\*", dat$Cycle)),]
#   Cycle                Time
# 1     1 2017-06-26 14:02:48
# 2     2 2017-06-26 14:02:49
# 3     3 2017-06-26 14:02:50
# 4     4 2017-06-26 14:02:51
# 5     5 2017-06-26 14:02:52
# 6    20 2017-06-26 14:02:53

You can make it look more readable with

dat[! cumany(grepl("***", dat$Cycle, fixed = TRUE)),]

So it can be inserted readily in a %>% pipeline:

library(dplyr)
dat %>%
  filter(! cumany(grepl("***", Cycle, fixed = TRUE)))

With your shown data, this should suffice. If there is any ambiguity of values within $Cycle, you should probably use a more resilient pattern for matching the cutoff.

r2evans
  • 77,184
  • 4
  • 55
  • 96
  • thanks. so how do I apply that to all the data frames in my list? – Tony D Jun 26 '17 at 22:15
  • Search SO for `[r] function list dataframes`, there are many questions that discuss manipulating `data.frame`s within `list`s. [Here's one example.](https://stackoverflow.com/a/24376207/3358272) – r2evans Jun 26 '17 at 22:22
  • ok, had a look but can't really find the answer I'm after. basically the problem i'm facing is that 'filter' can't be used for a list. maybe I should use 'map'? but can't figure out the right syntax to make it work. – Tony D Jun 26 '17 at 22:34
  • How about `many_files2 – r2evans Jun 26 '17 at 22:40
  • ok let's try an example like you provide dat – Tony D Jun 26 '17 at 22:56
  • Reread my suggestion and try again. (You are missing the `x, `.) – r2evans Jun 26 '17 at 23:09
  • 1
    thanks. a friend suggested: many_files – Tony D Jun 27 '17 at 00:17
  • That'll work well too. I should have included that, since you did reference the `tidyverse`, but using `map` outside of a `mutate` isn't yet a quick habit for me. Thanks for that suggestion. – r2evans Jun 27 '17 at 02:12
0

Here's one way to do that with filter from dplyr. Basically, you are looking for matches for "***" with grepl on column cycle. This will give you a logical vector. In my example, FALSE,FALSE,FALSE,TRUE, TRUE. Using cumsum on this vector, it will remain at 0 (FALSE) until it meets the first TRUE (1). You then filter and keep only the 0s.

df <- data.frame(cycle = c(1:3,"***","***"),value=1:5,stringsAsFactors = FALSE) 
df%>%
  filter(cumsum(grepl("***",cycle,fixed=TRUE))<1)

  cycle value
1     1     1
2     2     2
3     3     3
Pierre Lapointe
  • 14,914
  • 2
  • 31
  • 52