1

I apologise in advance if this is a stupid question. Any help is much appreciated.

I have a dataset with multiple sites and sampling years, with a score for every day of the year. For example, SiteA has 40 years of data with a value for every day, and sampling year defined as Sampling.Year. To make it confusing our sampling year is July-June so takes the form of 2016-2017. There are 8 variables all up and a lot of NA's. For example:

SiteName Sampling.Year   End.Date    H1_roll  T1_roll ... ... 
A        2015-2016                   1        NA   
A        2015-2016                   5        NA 
A        2015-2016                   2        NA 
A        2016-2017                   3        5 
A        2016-2017                   12       14
A        2016-2017                   6        7
B        2015-2016                   9        etc
B        2015-2016                   2 
B        2015-2016                   1 
B        2016-2017                   4 
B        2016-2017                   1 
B        2016-2017                   7

I need to do a few things with this dataset. First I need to add a column "Start.Date" that is End.Date-182 days. Then I need to omit (or tell R to ignore) all rows with NA in column T1_roll.

The next step is to select the highest T1_roll score for each Site & sampling year, but I need to make sure the rest of the information is retained with that selection. i.e. from the above example the max T1_roll for Site A in 2016-17 would be 14, and the new dataframe would include all variables from that line:

 SiteName Sampling.Year  Start.Date End.Date    H1_roll  T1_roll ... ...
 A        2016-2017                             12       14

I have saved a sample dataset here: Rolling Score Example.

Thank you

CatN
  • 49
  • 6

1 Answers1

1

We can group by 'SiteName' and slice the row with the maximum value of 'T1_roll'

library(dplyr)
df1 %>%
   group_by(SiteName) %>%
   slice(which.max(T1_roll))

Or with filter

df1 %>%
   group_by(SiteName) %>%
   filter(T1_roll == max(T1_roll, na.rm = TRUE))

In base R, this can be done with ave

i1 <- with(df1, ave(T1_roll, SiteName, FUN = 
        function(x) max(x, na.rm = TRUE)) == T1_roll)
df1[i1, ]
akrun
  • 674,427
  • 24
  • 381
  • 486
  • the filter option worked, thanks @akrun! Another stupid question (never had formal R training, learning as I go). How do I export that to a new csv? – CatN Jan 09 '20 at 02:54
  • @CatN You can use `out % group_by(SiteName) %>% filter(T1_roll == max(T1_roll, na.rm = TRUE))` and then do write.csv il.e. `write.csv(out, "yournewfile.csv", quote = FALSE, row.names = FALSE)` – akrun Jan 09 '20 at 03:01