1

I am a new R user and this is my first question submission (hopefully in compliance with the protocol).

I have a data frame with two columns.

df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" )) 
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])

   v1 n  
1   A 2
2   A 2
3   B 4
4   B 4
5   B 4
6   B 4
7   C 1
8   D 2
9   D 2
10  E 1

I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.

The result would include the following values for v1:

  v1
1  A
2  A
3  B
4  B
5  B
6  C
7  D
8  D
9  E

Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.

I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:

df1 <- df %>%
  group_by(v1) %>%
  filter(n() < 4)

This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.

df2 <- df %>%
  group_by(v1) %>%
  filter(n() > 3)

This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.

df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])

This approach has the same result as the first approach.

None of these methods produce the result I need.

As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.

Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.

Thanks.

danbret
  • 13
  • 3

3 Answers3

1

Using dplyr's top_n:

df %>% group_by(v1) %>% top_n(3)
Jacob
  • 2,759
  • 2
  • 16
  • 31
0

We can use data.table

library(data.table)
setDT(df)[, if(.N >3) head(.SD, 3) else .SD , v1]
akrun
  • 674,427
  • 24
  • 381
  • 486
0

This seems to do it:

index <- vector("numeric", nrow(df))

for (i in 1:nrow(df)) {
  if (sum(df[1:i, ] == as.character(df[i, 1])) <= 3) {

    index[i] <- i

  } else {

     cat(i)
   }

}


df[index, ]
   v1 n
1   A 2
2   A 2
3   B 4
4   B 4
5   B 4
7   C 1
8   D 2
9   D 2
10  E 1
William
  • 156
  • 10