8

I'd like to remove all rows of a data.table that contain Inf in any of its columns. So far, I've been using this approach:

DT <- data.table(col1 = c(1,2,3), col2 = c(4,Inf,5))
DT[,drop := apply(.SD, 1, function(x) any(is.infinite(x))), by = 1:nrow(DT)]
DT <- DT[(!drop)][,drop:=NULL]

which comes from this Stackoverflow question. However, this approach is not well scalable to large amounts of data. Is there a better way to remove the rows with Inf?

Community
  • 1
  • 1
AnjaM
  • 2,685
  • 5
  • 33
  • 54
  • 2
    May be `DT[DT[, Reduce('&', lapply(.SD, is.finite))]]` – akrun Mar 05 '15 at 12:38
  • Interestingly (?), the example given by the OP doesn't work with version 1.9.5, when @akrun solution works. –  Mar 05 '15 at 12:43
  • 3
    Why not use `rowSums`: `DT[is.finite(rowSums(DT))]` – shadow Mar 05 '15 at 12:46
  • Related: deleting rows by reference might some day be doable in `data.table` http://stackoverflow.com/a/10791729/1191259 – Frank Mar 05 '15 at 17:27

1 Answers1

18

You can use rowSums to check if any element of a row is not finite.

DT[is.finite(rowSums(DT))]

OR you can use the fact that Inf * 0 is NA and use complete.cases

DT[complete.cases(DT*0)]

Some benchmarking shows that the rowSums is fastest for smaller datasets and complete.cases is the fastest solution for larger datasets.

require(microbenchmark)
microbenchmark(
  DT[is.finite(rowSums(DT))]
  ,
  DT[complete.cases(DT*0)]
  ,
  DT[DT[, Reduce('&', lapply(.SD, is.finite))]]
)
##
## nrow(DT) = 3000
## Unit: microseconds                                          
##                                          expr       min       lq      mean    median       uq      max neval cld
##                     DT[is.finite(rowSums(DT))]  786.797  839.235  864.0215  852.8465  884.756 1021.988   100 a  
##                     DT[complete.cases(DT * 0)] 1265.658 1326.575 1363.3985 1350.0055 1386.377 1898.040   100   c
##  DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 1220.137 1275.030 1319.6226 1308.0555 1348.443 1624.023   100  b 
##
## nrow(DT) = 300000
## Unit: milliseconds
##                                           expr       min        lq      mean   median       uq      max neval cld
##                     DT[is.finite(rowSums(DT))] 21.617935 22.687452 26.698070 25.75765 26.07942 87.56290   100   c
##                     DT[complete.cases(DT * 0)]  7.209252  7.567393  9.908503 10.17569 10.37473 71.31375   100 a  
##  DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 11.786773 12.647652 14.128624 14.78512 15.05089 15.39542   100  b 
shadow
  • 20,147
  • 4
  • 49
  • 71