1

So i have a large CSV file with about 280 columns and 1 billion data having a file size of about 20GB. A sample of this file(with about 7 columns and 4 rows) is provided below:

SL No.,Roll No.,J_Date,F_Date,S1,S2,S3
1,00123456789,2004/09/11,2009/08/20,43,67,56
2,987654321,2010/04/01,2015/02/20,82,98,76
3,0123459876,2000/06/25,2005/10/02,72,84,02
4,000543216789,1990/08/29,1998/05/31,15,64,82

Now given the fact that the file is so large, i would have to read this file in smaller chunks at a time with me being able to specify the chunk size. But as u might have seen from the sample, "Roll No." has to be read as a "character" and not as a "numeric". Also i need to add the columns "S1","S2","S3" and write the sum to a new column "MM".

The output of the above sample has to be something like this:

SL No.,Roll No.,J_Date,F_Date,S1,S2,S3,MM
1,00123456789,2004/09/11,2009/08/20,43,67,56,166
2,987654321,2010/04/01,2015/02/20,82,98,76,256
3,0123459876,2000/06/25,2005/10/02,72,84,02,158
4,000543216789,1990/08,29,1998/05/31,15,64,82,161

I know similar questions has been asked before but i swear i couldn't get 1 answer that worked for me. I referred the following Quetions:

R:Loops to process large dataset(GBs) in chunks?

Trimming a huge (3.5 GB) csv file to read into R

How do i read only lines that fulfil a condition from a csv into R?

Reading numbers as strings

Read numeric input as string R and many more.

This might be a good time to say that i'm a total beginner when it comes to R, so all kinds of help would be very much appreciated. I've been sitting on this for a long while now.

Community
  • 1
  • 1
Zaire
  • 67
  • 7
  • The expected output you had only sum instead of multiplying. you can either do `Reduce('+', df1[5:7])` or `rowSums(df1[5:7])` – akrun Feb 13 '17 at 09:46
  • Your second issue (the class of column 2) is not an issue if you use `fread` from package data.table. It's rather easy to write a loop that processes the file in chunks with `fread`'s `skip` and `nrows` parameters. Since `fwrite` can `append` output to file also shouldn't be a problem. – Roland Feb 13 '17 at 09:52
  • @akrun :Can u please elaborate, i'm fairly new to R. – Zaire Feb 13 '17 at 10:23
  • @Roland: canu please elaborate. I'm fairly new to R and would need a little more help. – Zaire Feb 13 '17 at 10:25
  • If you want to hve the sum of the columns 5:7, then `df1$MM – akrun Feb 13 '17 at 10:25
  • @akrun: if i want to perform this operation, i need to first read the file in chunks, only after that can i do this. Still thanks for your suggestion. – Zaire Feb 13 '17 at 10:29
  • For big datasets, use `rxImport` from `RevoScaleR` if you are able to afford it, which is scalable and don't use up the memory – akrun Feb 13 '17 at 10:30

1 Answers1

1

I can't say I've done this myself before, but I think this should work.

library( data.table )

# set the input and output files
input.file <- "foo.csv"
output.file <- sub( "\\.csv$", "_output\\.csv", input.file )

# get column names by importing the first few lines
column.names <- names( fread( input.file, header = TRUE, nrows = 3L ) )

# write those column names as a line of text (header)
cat( paste( c( column.names, "MM" ), collapse = "," ),
     file = output.file, append = FALSE )
cat( "\n", file = output.file, append = TRUE )

# decide how many rows to read at a time
rows.at.a.time <- 1E4L

# begin looping
start.row <- 1L
while( TRUE ) {

    # read in only the specified lines
    input <- fread( input.file,
                    header = FALSE,
                    skip = start.row,
                    nrows = rows.at.a.time
    )

    # stop looping if no data was read
    if( nrow( input ) == 0L ) break

    # create the "MM" column
    input[ , MM := rowSums( .SD[ , 5:7 ] ) ]

    # append the data to the output file
    fwrite( input,
            file = output.file,
            append = TRUE, col.names = FALSE )

    # bump the `start.row` parameter
    start.row <- start.row + rows.at.a.time

    # stop reading if the end of the file was reached
    if( nrow( input ) < rows.at.a.time ) break

}

UPDATE: to preserve character strings, you can import all columns as character by specifying in the fread call within the loop:

colClasses = rep( "character", 280 )

Then, to perform the row sums (since you now have all character columns), you need to include a conversion there. The following would replace the single line (the one with this same comment above it) in the code:

# create the "MM" column
input[ , MM := .SD[ , 5:7 ] %>%
           lapply( as.numeric ) %>%
           do.call( what = cbind ) %>%
           rowSums()
       ]

Where 5:7 is specified here, you could replace with any vector of column references to be passed to rowSums()

Note if using the above with %>% pipes, you'll need library(magrittr) at the top of your code to load the function.

rosscova
  • 4,810
  • 1
  • 14
  • 32
  • This line `output.file – Zaire Feb 13 '17 at 11:20
  • Also is this solution dealing with reading the 2nd column as a character? – Zaire Feb 13 '17 at 11:22
  • You need to add that to the `fread` call inside the loop. See the `colClasses` parameter I just added. – rosscova Feb 13 '17 at 11:25
  • As i've mentioned in the question, the input file is having about 280 columns, and so manually providing classes for all the 280 columns would be very very difficult. Do u think u can suggest an alternate way to do this? – Zaire Feb 13 '17 at 11:32
  • @Zaire There is no reason to make it character. `fread` will read it as integer64 if it can't be represented as integer. – Roland Feb 13 '17 at 11:36
  • I'm with @Roland, why exactly do you need it to be `character`? – rosscova Feb 13 '17 at 11:38
  • @rosscova: it will not always be the 2nd column. And i need it to be character so that it will read the value `00123456789` as `00123456789` itself rather than `123456789`. – Zaire Feb 13 '17 at 11:43
  • Then how will you identify that column? By name? ie: how do you know which columns to import as numeric/integer (for `rowSums`), and which to import as "character"?... You can import ALL columns as character with something like `colClasses=rep("character",280)` if that's better for you. – rosscova Feb 13 '17 at 11:44
  • @rosscova: i would say identifying by column name would be the best option – Zaire Feb 13 '17 at 11:47
  • @rosscova: how will i read everything as charcter, then how will i add the columns? – Zaire Feb 13 '17 at 11:51
  • See my update in the answer. Do you understand how to specify the `colClasses` parameter when calling `fread`? – rosscova Feb 13 '17 at 11:54
  • by the way, what is this %>% symbol? – Zaire Feb 13 '17 at 12:02
  • That's a `magrittr` pipe. I should have added `library(magrittr)` to the top of the code when I made that addition. Get to know the pipe; it's good. – rosscova Feb 13 '17 at 12:06