Modeling a very big data set (1.8 Million rows x 270 Columns) in R

Question

I am working on a Windows 8 OS with a RAM of 8 GB . I have a data.frame of 1.8 million rows x 270 columns on which I have to perform a glm. (logit/any other classification)

I've tried using ff and bigglm packages for handling the data.

But I am still facing a problem with the error "Error: cannot allocate vector of size 81.5 Gb". So, I decreased the number of rows to 10 and tried the steps for bigglm on an object of class ffdf. However the error still is persisting.

Can any one suggest me the solution of this problem of building a classification model with these many rows and columns?

**EDITS**:

I am not using any other program when I am running the code. The RAM on the system is 60% free before I run the code and that is because of the R program. When I terminate R, the RAM 80% free.

I am adding some of the columns which I am working with now as suggested by the commenters for reproduction. OPEN_FLG is the DV and others are IDVs

str(x[1:10,])
'data.frame':   10 obs. of  270 variables:
 $ OPEN_FLG                   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1    
 $ new_list_id                : Factor w/ 9 levels "0","3","5","6",..: 1 1 1 1 1 1 1 1 1 1    
 $ new_mailing_id             : Factor w/ 85 levels "1398","1407",..: 1 1 1 1 1 1 1 1 1 1    
 $ NUM_OF_ADULTS_IN_HHLD      : num  3 2 6 3 3 3 3 6 4 4    
 $ NUMBER_OF_CHLDRN_18_OR_LESS: Factor w/ 9 levels "","0","1","2",..: 2 2 4 7 3 5 3 4 2 5    
 $ OCCUP_DETAIL               : Factor w/ 49 levels "","00","01","02",..: 2 2 2 2 2 2 2 21 2 2    
 $ OCCUP_MIX_PCT              : num  0 0 0 0 0 0 0 0 0 0    
 $ PCT_CHLDRN                 : int  28 37 32 23 36 18 40 22 45 21   
 $ PCT_DEROG_TRADES           : num  41.9 38 62.8 2.9 16.9 ...    
 $ PCT_HOUSEHOLDS_BLACK       : int  6 71 2 1 0 4 3 61 0 13    
 $ PCT_OWNER_OCCUPIED         : int  91 66 63 38 86 16 79 19 93 22    
 $ PCT_RENTER_OCCUPIED        : int  8 34 36 61 14 83 20 80 7 77    
 $ PCT_TRADES_NOT_DEROG       : num  53.7 55 22.2 92.3 75.9 ...    
 $ PCT_WHITE                  : int  69 28 94 84 96 79 91 29 97 79    
 $ POSTAL_CD                  : Factor w/ 104568 levels "010011203","010011630",..: 23789 45173 32818 6260 88326 29954 28846 28998 52062 47577    
 $ PRES_OF_CHLDRN_0_3         : Factor w/ 4 levels "","N","U","Y": 2 2 3 4 2 4 2 4 2 4    
 $ PRES_OF_CHLDRN_10_12       : Factor w/ 4 levels "","N","U","Y": 2 2 4 3 3 2 3 2 2 3    
 [list output truncated]

And this is the example of code which I am using.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

require(ff)
x$id <- ffseq_len(nrow(x))
xex <- expand.ffgrid(x$id, ff(1:100))
colnames(xex) <- c("id","explosion.nr")
xex <- merge(xex, x, by.x="id", by.y="id", all.x=TRUE, all.y=FALSE)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = xex)

The problem is both times I get the same error "Error: cannot allocate vector of size 81.5 Gb".

Please let me know if this is enough or should I include anymore details about the problem.

You have taken some of the steps that I would recommend. So, to get more specific feedback you need to create a reproducible example which shows us the problem you have. — Paul Hiemstra, Jun 25 '13 at 11:30
How is your memory looking? Are any other applications using a lot of memory? Can you remove any unused objects from your R workspace and use `gc()` to clear up space? — Bill Beesley, Jun 25 '13 at 11:38
Do show the code and some data (str(yourdata[1:10,]) will do) to get a good feedback so that we can see what you are doing and correct any mistakes you inadvertedly make. — , Jun 25 '13 at 11:47
@all, I edited the questions based on the comments you gave. Please check it. Thanks. — Srikanth Gorthy, Jun 25 '13 at 12:24
For big logistic regressions you can also use [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki), which [can be called from R](http://cran.r-project.org/web/packages/RVowpalWabbit/index.html). — Vincent Zoonekynd, Jun 25 '13 at 13:10
To use Vowpal Wabbit on Windows, I would run it inside a (headless) Linux virtual machine but, indeed, that may take *some* time to set up... — Vincent Zoonekynd, Jun 25 '13 at 13:49
Are you really using all 270 columns in your regression model? If not, then get rid of the unused ones. If so, I'd probably just sample from the rows to start. Get that working first. That might be good enough... Run a few cross validated samples; see if the fit weights match up. — Clayton Stanley, Jun 26 '13 at 08:26
@ClaytonStanley: I want to remove columns, but I donno any statistical way to remove columns. Can you please help me in this context of removing columns, like any functions/packages which can be used? — Srikanth Gorthy, Jun 26 '13 at 11:13
Not sure what you mean by 'statistical'. If you just want to programmatically remove columns in a dataframe in R: http://stackoverflow.com/questions/4605206/drop-columns-r-data-frame . If you want to statistically figure out which columns (i.e., IVs) are not highly predictive of your model, well then you'd have to run the full model first and look at the weights for each IV. If that's the case, I'd still recommend sampling rows and cross validating. — Clayton Stanley, Jun 26 '13 at 20:12

score 3 · Answer 1 · answered Jun 25 '13 at 12:45

I have the impression you are not using ffbase::bigglm.ffdf but you want to. Namely the following will put all your data in RAM and will use biglm::bigglm.function, which is not what you want.

require(biglm)
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = x)

You need to use ffbase::bigglm.ffdf, which works chunkwise on an ffdf. So load package ffbase which exports bigglm.ffdf. If you use ffbase, you can use the following:

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

Explanation: Because you don't limit yourself to the columns you use in the model, you will get all your columns of your xex ffdf in RAM which is not needed. You were using a gaussian model on a factor response, bizarre? I believe you were trying to do a logistic regression, so use the appropriate family argument? And it will use ffbase::bigglm.ffdf and not biglm::bigglm.function.

If that does not work - which I doubt, it is because you have other things in RAM which you are not aware of. In that case do.

require(ffbase)
mymodeldataset <- xex[c("OPEN_FLG","new_list_id","NUM_OF_ADULTS_IN_HHLD","OCCUP_MIX_PCT")]
mymodeldataset$OPEN_FLG <- with(mymodeldataset["OPEN_FLG"], ifelse(OPEN_FLG == "Y", TRUE, FALSE))
ffsave(mymodeldataset, file = "mymodeldataset")

## Open R again
require(ffbase)
require(biglm)
ffload("mymodeldataset")
mymodel <- bigglm(OPEN_FLG ~ new_list_id+NUM_OF_ADULTS_IN_HHLD+OCCUP_MIX_PCT, data = mymodeldataset, family=binomial())

And off you go.

Thank you for the answer. But when I run the code, the ffsave is giving an error mentioned below. {Error in system(cmd, input = filelist, intern = TRUE) : 'zip' not found} Can you please help me with this error? I tried to search for the error but wasn't able to find any good solutions. — Srikanth Gorthy, Jun 26 '13 at 06:27
Either make sure zip is installed on your system and is in your PATH (see http://stackoverflow.com/questions/14971977/r-ff-package-ffsave-zip-not-found) Or replace the ffsave and ffload by respectively: save.ffdf(mymodeldataset, dir = "mymodeldataset") and load.ffdf(dir = "mymodeldataset") which does not zip the ff files to store them. — , Jun 26 '13 at 06:50
I used the steps you suggested, but I am facing the problem again. I am getting the error : Error: cannot allocate vector of size 81.5 Gb even with 30 rows, 190 columns and a chunksize of 2 in bigglm with an ffdf object. Kindly do share insights on any other improvement in this approach. Also I was wondering if there is any approach other than the ffdf:bigglm. — Srikanth Gorthy, Jun 26 '13 at 11:01
You need to provide more details on your 190 columns. bigglm will use model.matrix at a certain time which will put your factors in numeric matrices. How many factors do you have in these 190 columns and how much levels does each factor have? — , Jun 26 '13 at 16:03

Modeling a very big data set (1.8 Million rows x 270 Columns) in R

1 Answers1

Linked