4

I’m planning to do some data analysis with R; the datasets are stored in PostgreSQL tables and some of them contain up to 2 million records. I thought this would not be a big problem for R and loading of the records would be rather quick, but the things turned out differently.

Doing something like this may take a minute or more, which is not what I was expecting:

library(RPostgreSQL);

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "mydb", user="me", password="my_password");

records <- dbGetQuery(con, statement = paste(
  "SELECT *",
  "FROM my_table",
  "ORDER BY id")); # 1.5M records

Alternative code is even slower:

records2 <- dbSendQuery(con, "select * from my_table ORDER BY id")
fetch(records2,n=-1)

I can’t say my hardware is the most advanced in the world, but it’s a rather decent MacBook Pro with 8G RAM and SSD. When I fetch the same data with, let’s say, QGIS, things are done significantly faster.

What can one do to increase performance in such case? Alternative libraries? Tricks and hacks? Anything else?

Cœur
  • 32,421
  • 21
  • 173
  • 232
Alexander Kachkaev
  • 741
  • 10
  • 26
  • Hmm, I've just tested this code on table with 1M records. It's as fast as typical query. What kind of data are you using? Also, show `sessionInfo()` output, please. – redmode Feb 04 '14 at 17:08
  • @redmode, what kind of R environment are you using (if that matters)? What’s ‘as fast as typical query’ means in your case? – Alexander Kachkaev Feb 04 '14 at 19:32
  • 2
    Do you mean IDE? I'm using RStudio. Query for 1M records took ~2secs, roughly the same timing when run from pgadmin3. – redmode Feb 04 '14 at 19:59
  • 1
    Also, it can be a good idea to work with PostgreSQL without fitting all data in memory. For instance, using wonderful `dplyr` package. See `dplyr::databases` vignette for details. – redmode Feb 04 '14 at 21:03
  • You could try to use `copy` postgresql command to offload your data to CSV or TSV and process it in your `r` code – Konstantin V. Salikhov May 12 '14 at 18:41

2 Answers2

1

You should exclude ORDER BY - this is not part of loading data. It may significantly slows your query.
You can order data afterwards having them in R memory. If you are looking for fast ordering check this SO answer.

Community
  • 1
  • 1
jangorecki
  • 14,077
  • 3
  • 57
  • 137
0

More of a redesign than answer to the question, but...

You could always plug R directly into PostgreSQL, and run your query without moving the data anywhere. Can't move it any faster than not moving it at all :)

PL/R for PostgreSQL

Kirk Roybal
  • 15,289
  • 1
  • 24
  • 37