I’m planning to do some data analysis with R; the datasets are stored in PostgreSQL tables and some of them contain up to 2 million records. I thought this would not be a big problem for R and loading of the records would be rather quick, but the things turned out differently.
Doing something like this may take a minute or more, which is not what I was expecting:
library(RPostgreSQL);
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "mydb", user="me", password="my_password");
records <- dbGetQuery(con, statement = paste(
"SELECT *",
"FROM my_table",
"ORDER BY id")); # 1.5M records
Alternative code is even slower:
records2 <- dbSendQuery(con, "select * from my_table ORDER BY id")
fetch(records2,n=-1)
I can’t say my hardware is the most advanced in the world, but it’s a rather decent MacBook Pro with 8G RAM and SSD. When I fetch the same data with, let’s say, QGIS, things are done significantly faster.
What can one do to increase performance in such case? Alternative libraries? Tricks and hacks? Anything else?