1

I have the output from a data submission which is in the form of multiple vector list objects in rda files. Each list object is in a separate rda file and i have nearly 2000 files. I want to merge all the objects into a single object in a single rda file in the fastest way (partly because i may need to repeat this several times). All the rda files are fairly small (~10mb though this will be a compressed size), but it all adds up with the number of files.

Memory isn't a huge problem as am running it on a server with >700GB RAM, My first approach to incrementally load them one by one concatenate with the merged list object and remove the object that was appended went badly due to the time it was going to take (something like 40 days at a best guess). My revised approach is below, but wondering if there is a quicker way to do this given that i may need to repeat the process:

 load("data_1.rda") 
 load("data_2.rda") 
 load("data_3.rda") ... 
 load("data_2000.rda")
 my.list <- list() 
 my.list <- c(my.list, data.1, data.2, data.3, ... , data.2000) 
 save(my.list, file="my_list.rda")

And just to add to things i'm getting an error when doing this:

Error: attempt to set index 18446744071562067968/2877912830 in SET_STRING_ELT

It's not a very helpful error message All the rdas load as objects into the environment fine, but when i try and concatenate them that is when I get the error message, and it seems like it is when it gets to a particular point as it doesn't fail immediately. Wasn't sure if it is some sort of limit in the number of concatenations you can do or rogue data, but troubleshooting it it appears to be syntax rather than data related. Have chunked it up into 5 batches and then doing a final concatenation before saving the rda. Have seen other answers for this sort of thing suggesting using rbind, mget, and do.Call or list function - would using any of these functions make it faster and achieve the same thing? Something like this:

my.list <- do.call(rbind, mget(ls(pattern="^data_")))

Thanks

lpulle
  • 11
  • 1
  • Revising my last comment: `load` brings all the objects back into your environment, then you want to get all those objects into a list, then use `data.table::rbindlist()` to condense down into one data frame. If you're not already using a similar solution, you can load all your files using `fileList – Mako212 Dec 18 '17 at 19:36
  • The foreach package will allow you to read each data set and ".combine" them. – Harlan Nelson Dec 18 '17 at 20:03
  • I'd suggest looking at [How to make a list of data frames](https://stackoverflow.com/a/24376207/903061). – Gregor Thomas Dec 18 '17 at 20:36
  • Thanks for all the suggestions - i'll probably end up using the filelist – lpulle Dec 19 '17 at 12:32

0 Answers0