sparse.model.matrix loses rows in R

Question

I am working on regular data.frame that looks to be to big for glm function so I've decided I'll work on a sparse represantation of a model matrix so I could put this sparse matrix into glmnet function. But sparse.model.matrix looks like to drops some rows from original matrix. Any idea why that happens and any solution how to avoid that? Code below:

> mm <- sparse.model.matrix(~clicks01+kl_tomek*bc1+hours+plec+1, 
data = daneOst)
> dim(mm)
[1] 1253223     292
> dim(daneOst)
[1] 1258836       6

score 15 · Accepted Answer · edited May 23 '17 at 12:03

15

I've had some success with changing the na.action to na.pass, this includes all the rows in my matrix:

options(na.action='na.pass')

Just note that this is a global option, so you probably want to set it back to it original value after, to not mess with the rest of your code.

previous_na_action <- options('na.action')
options(na.action='na.pass')
# Do your stuff...

options(na.action=previous_na_action$na.action)

Solution from this answer.

edited May 23 '17 at 12:03

Community

1
1

answered Aug 01 '16 at 17:59

Bar

2,234
3
29
37

Note that I'm using this for tree-based models that can deal with NA values. YMMV – Bar Aug 01 '16 at 18:00
I think this is the best solution, so far. That's why I am accepting this as an answer. – Marcin Kosiński Aug 02 '16 at 12:23

score 8 · Answer 2 · edited May 05 '15 at 15:39

8

It's due to the NA's !

Run sum(complete.cases(mm)). I bet it also gives you 1253223.

So replace the NA's in your dataframe by a value (eg. 'IMPUTED_NA' or -99999), and then try again.

edited May 05 '15 at 15:39

lord.garbage

5,514
4
28
51

answered May 05 '15 at 15:27

WillemM

1,561
1
10
12

score 2 · Answer 3 · answered Oct 17 '15 at 00:22

@WillemM is correct. The presence of NAs will trip off sparse matrix. With big data sets, the best approach is to read in your file into a data frame with stringsAsFactors=FALSE and then choose whatever imputation method you want. If you choose to use a tree based learning methods, its easier to impute these NAs with something not present in the data set. Multiple imputation on big data sets will take insanely long, and you may also lose R sessions.

sparse.model.matrix loses rows in R

3 Answers3