How to filter the dataframe based on a particular column or group?

Question

I have a df such as below...

         V1          V2   V3      V4    V5   V6   V7    V8
m.Bra004793   Bra004793  887  887.00 21.74 0.45 0.29 16.40
m.Bra004793.1 Bra004793  907  907.00 20.52 0.42 0.27 15.11
m.Bra004793.2 Bra004793 1006 1006.00 16.39 0.30 0.19 10.81
m.Bra004793.3 Bra004793  988  988.00 56.56 1.05 0.67 38.02
m.Bra004793.4 Bra004793 1097 1097.00 32.69 0.54 0.35 19.67

For each of the unique id (such as Bra004793) i want to select the best V1 by selecting the maximum V8. For example in this case i wanted to get the following id

m.Bra004793.3 Bra004793  988  988.00 56.56 1.05 0.67 38.02

but unfortunately the dplyr package that i am trying is not working. This is what i tried so far..

        test <- read.table("test_PASA_isoform.csv", sep = ",", h = T)
        head(test)
        data.filetered <- as.data.frame(test %.% group_by(V2) %.% summarise(V8 = max(V8)))
        head(data.filetered)
         V2          V1    V8
1 Bra004793 m.Bra004793 38.02

Here you can see even though i am getting the correct result, i am not getting the correct V1 id. Can anybody point to me where i am doing wrong.

Thanks Upendra

Maybe: `test %.% group_by(V2) %.% filter( V8 == max(V8) )` ? — thelatemail, Apr 10 '14 at 01:27
Great it worked. I was not aware of the filter function within dplyr package. Thanks..... — upendra, Apr 10 '14 at 05:20

score 1 · Answer 1 · answered Apr 10 '14 at 00:10

Not solution for dplyr package, but it is easier by shell command awk

I set two IDs for demo.

cat file
         V1        V2   V3      V4    V5   V6   V7    V8
   m.Bra004793 Bra004793  887  887.00 21.74 0.45 0.29 16.40
 m.Bra004793.1 Bra004794  907  907.00 20.52 0.42 0.27 15.11
 m.Bra004793.2 Bra004793 1006 1006.00 16.39 0.30 0.19 10.81
 m.Bra004793.3 Bra004794  988  988.00 56.56 1.05 0.67 38.02
 m.Bra004793.4 Bra004793 1097 1097.00 32.69 0.54 0.35 19.67

Here is awk command:

awk '{if (max[$2]<$8){max[$2]=$8;l[$2]=$0}}END{for (i in max) print l[i]}' file

         V1        V2   V3      V4    V5   V6   V7    V8
 m.Bra004793.4 Bra004793 1097 1097.00 32.69 0.54 0.35 19.67
 m.Bra004793.3 Bra004794  988  988.00 56.56 1.05 0.67 38.02

It worked great but would still like to know why the dplyr command here is not working. Thanks anyway for the help. — upendra, Apr 10 '14 at 01:13

score 0 · Answer 2 · answered Apr 10 '14 at 06:54

I guess V2 is your unique ID, since otherwise there would be no maximum to choose from [every row in V1 is unique]. In that case, a data.table solution is:

library(data.table)
df = data.table(read.table(header = T, text = "
V1          V2   V3      V4    V5   V6   V7    V8
m.Bra004793   Bra004793  887  887.00 21.74 0.45 0.29 16.40
m.Bra004793.1 Bra004793  907  907.00 20.52 0.42 0.27 15.11
m.Bra004793.2 Bra004793 1006 1006.00 16.39 0.30 0.19 10.81
m.Bra004793.3 Bra004793  988  988.00 56.56 1.05 0.67 38.02
m.Bra004793.4 Bra004793 1097 1097.00 32.69 0.54 0.35 19.67
"))

df[,best := max(V8), by = V2]
df[V8 == best,]

score -1 · Answer 3 · edited Apr 10 '14 at 06:51

-1

Maybe you could use something like below:

test[test$V8==max(test$V8),]

edited Apr 10 '14 at 06:51

Rigel Networks

4,766
2
22
41

answered Apr 10 '14 at 06:29

mherradora

21
4

How to filter the dataframe based on a particular column or group?

3 Answers3