I am a new R user and this is my first question submission (hopefully in compliance with the protocol).
I have a data frame with two columns.
df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" ))
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
6 B 4
7 C 1
8 D 2
9 D 2
10 E 1
I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.
The result would include the following values for v1:
v1
1 A
2 A
3 B
4 B
5 B
6 C
7 D
8 D
9 E
Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.
I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:
df1 <- df %>%
group_by(v1) %>%
filter(n() < 4)
This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.
df2 <- df %>%
group_by(v1) %>%
filter(n() > 3)
This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.
df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])
This approach has the same result as the first approach.
None of these methods produce the result I need.
As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.
Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.
Thanks.