22

I’m trying to improve my workflow, and am hoping the community can provide insights since I am relatively new to “big data”.

I typically download several dataframes from public sources which can be related to one another. After pulling several dataframes, I do various data-processing steps (e.g., filtering, sorting, aggregating, custom calculations) before doing regression analysis and/or other inferential statistics on my final set of data.

Specifically, which strategy would you recommend:

  1. Downloading all of the data as separate files from the web onto my local machine, then using R to process it directly (as I have been doing). The potential problem I see with this is that everything is in R’s working environment, which may slow the process and crash my machine.

or

  1. Downloading all of the data as separate files from the web onto my local machine, creating a database with sqldf on the local machine, and using follow-up queries with sqldf to pull and aggregate information from that database before doing final data analysis in R. The potential problem I see with this is that a database, made up of a handful of tables/dataframes, created on my local machine with sqldf is larger in size than simply saving several individual .csv files.

I’m pretty familiar with statistical techniques, but I admittedly have several knowledge gaps when it comes to database management and server operations. I’ve become familiar with the nuts and bolts of SQL, as a language, and I know how to use sqldf with dataframes running in the R working environment. However, I frankly do not know what advantage that offers over just learning how to use the base R functions to filter, sort, and aggregate data. Also, I’ve read a few webpages about the hype of pairing SQL Server with R, but I’m unsure if this is a good option for me since I run everything locally.

Any tips for this newbie on how to improve my data processing and analytics via combining R with some implementation of SQL?

Thank you in advance!

Jayden.Cameron
  • 431
  • 3
  • 12

3 Answers3

9

Given that you are looking for a "workflow best practices", there's a high premium that should be placed on both reproducibility and transparency. Since your goal is data analysis and not data collection or database management, there's not a strong reason to be creating your own databases and a custom database would likely make your workflow and analysis less transparent. In short, if you don't need to build a database, don't.

It sounds like your workflow is as follows:

  1. Download data from public sources (ideally .csv or a similarly nice format)
  2. Clean and process the data
  3. Run analysis on the (potentially linked) cleaned data

I would recommend dividing your workflow into two distinct steps:

1. Download and Clean Data

If your files are all .csv (or other regular delimited files), then you just need the data.table package for this step. You can write a single R script to download, clean, and save the data you need. A minimal example is below:

# Download data
library(data.table)
salary_data <- fread('https://data.phila.gov/api/views/25gh-t2gp/rows.csv')

# Clean data (only looking at City Council salaries)
cleaned_data <- salary_data[Department == 'CITY COUNCIL']

# Saving cleaned data
save(cleaned_data, file = 'my_file_name.rda', compress = TRUE)

Ideally, you would only have to run this file once to generate the dataset you actually perform your statistical analysis on. If you decide to clean or process your data differently, just revisit this file, make the appropriate changes, and rerun it. I would recommend having one script for each file that you are downloading so that it is easy to see how you are processing the raw data straight from the source (transparency). Simply having this file satisfies reproducibility.

2. Statistical Analysis

If you need to combine your datasets, data.table provides a fast and transparent way of doing this. Simply load in your cleaned individual datasets, identify the key that you will use to merge them, and then merge them. Then run your analysis on the merged dataset. Below is an example of this functionality:

# dt1 has salary information for 10 people and dt2 
# has the number of kids for the same 10 people
library(data.table)
dt1 <- data.table(id = 1:10, salary = sample(0:100000, 10)
dt2 <- data.table(id = 1:10, kids = sample(0:5, 10)
save(dt1, file = 'dt1.rda', compress = TRUE)
save(dt2, file = 'dt2.rda', compress = TRUE)

# Loading and merging data
load(file = 'dt1.rda')
load(file = 'dt2.rda')
setkey(dt1, id)
setkey(dt2, id)
merged_dt <- merge(dt1, dt2)

# Doing regression analysis on merged data
reg <- lm(salary ~ kids, data = merged_dt)

This makes the merging procedure and subsequent analysis transparent and reproducible.

Summary

This procedure ensures that your data sources, data cleaning/processing, and analysis are well-documented, transparent, and reproducible. Furthermore, this procedure scales with your computer. If you do not need to build a database, then don't.

What if the data is too big for my computer? If you need more space, just run the code you've already written on a dedicated server or Amazon Web Services machine.

What if the data is too big for a dedicated server? Chances are the data is stored in an actual database and the only piece of the workflow that changes is your data downloading and (potentially) some of the processing will be a SQL query to the database (most likely using the DBI package which runs SQL queries in R), which then should be small enough to run locally or on a dedicated server.

What if my data is too big for that? You probably should look into more heavy-duty big data languages like Hadoop.

Supplemental Note: If your data is not in a regular delimited format (such as an Excel, SAS, or Stata file), then I would recommend using the download_file() function coupled with the tidyverse package (which has a fantastic ability to read these less pleasant, but common files)

library(tidyverse)
taxi_data_excel <- download.file(url = 'http://www.nyc.gov/html/tlc/downloads/excel/current_medallion_drivers.xls', destfile = 'taxi_data_excel.xls')
taxi_data <- read_excel('taxi_data_excel.xls')

Then do your cleaning as usual.

Mallick Hossain
  • 611
  • 5
  • 12
  • Would put +2 for this answer if I could. Yes, don't use `sqldf` unless you really have to: the powerhouse `data.table` is quite simple to learn when you have `SQL` experience and offers many more possibilities. Writing data.tables to files is a very flexible and efficient alternative. – Jealie Apr 06 '17 at 20:05
6

First things first. sqldf is not a database, it is a package that allows you to manipulate a data.frame object in SQL syntax. Well, to be precise, it uses SQLite in the back-end, but you shouldn't consider a sqldf package as a database.

sqldf is a good and convenient package. In some cases, it might be effective as well, but effectiveness is not its primary goal. I recommend you to consider a data.table package. It is designed for effectiveness and the performance might surprise you.

The first and main advice in choosing a strategy goes as follows: Respect the trade off factor! Deploying actual SQL database with R might give you huge advantage, but it puts significant overhead in development process. It all depends on the scope of project. There are no general rules, but I can try to indicate some rules of thumb.

  • Be default I would try to avoid involving SQL database, unless I face project specific arguments for SQL.

  • If the bottle neck is RAM and R is necessary only for aggregated data, then you should really consider using SQL database. For example, MySQL will take care of paging, caching and multi-threading - this is might be important arguments.

  • If the data structure of different sources has significant differences then using SQL will put extra overhead because you will have to manage it in R and in SQL - try to avoid that. On the other hand, if there are a lot of sources with same data structure then database will give you good improvement.

  • If you only need to proceed source data, then dealing with files is OK. But if you need to run it repeatedly and save all the outputs, changes, versions and so on, then database becomes a necessity.

This is just my humble opinion.

Vyga
  • 773
  • 7
  • 6
  • To clarify, I do not consider `sqldf` as a database. – Jayden.Cameron Apr 05 '17 at 09:29
  • I am weighing the costs/benefits of using the `sqldf` package within `r` in order to create a database on my local machine, and then continuing to use the `SQLite` syntax via `sqldf` to run queries on the data stored in that database instead of the built-in `r` functions (`merge`, `aggregate`, `with`, `by`, and others, for example). Your bullet-point list helps me better understand some of the tradeoffs in using a database, so thank you. However, I do not fully understand how your recommendation for the `data.table` package fits in. – Jayden.Cameron Apr 05 '17 at 09:45
  • As Vyga has said, sqldf is not a database nor does it allow you to "create a database on" your local machine. The package merely allows you to query data frames in R. Data.table package is the alternative to using data frames in R. Its built for efficiency and ease of use. I think Vyga merely implied that you can use it in addition to either choice (of R with or without a DB) instead of using data frames. – AYR Apr 05 '17 at 12:05
  • @Jayden.Cameron, note that `data.table` is package for data manipulations, like `merge`, `sort`, `aggregate` and so on. This is an alternative way to do operations your are interesting in. Sometimes effectiveness is stunning. In addition, you can aware of `ddply` packages as good `by` function alternative. – Vyga Apr 05 '17 at 13:41
5

It very much depends on the infrastructure of your environment, but in the world of "Big Data" I would recommend using both because each has advantages that are difficult to give up.

Most actions of data cleansing and manipulation can be performed on both platforms some at the cost of performance and some at the cost of resources.

In-Memory: R's environment is mostly within RAM. Which is much faster, but not always necessary. If you have a 100 GB dataset loading it into RAM would be infeasible. Most databases have already introduced in-memory tables so if there are specific tables you want faster access to then you can always load them into RAM.

Indexes & Partitions: Its much easier to query data that has been efficiently indexed and partitioned on a database than through CSV files. Most exploratory analysis is done on partitions or clusters of data and giving this up is a huge performance trade-off.

Downloading and Storing: In R it is very easy to write a script for downloading data and uploading it to a database. On a database the data can be stored more easily for fast access as well as be compressed efficiently for performance and scalability.

Table Views: There are many datasets or basic manipulations on datasets that you would want to store for later use. In a database you can make use of table views which can join and manipulate data across any number of tables. In order to obtain the same result in R you would have to load all the relevant tables and perform the merges and manipulations every time you would like to access the same data.

Analysis: This is what R was built for. Many databases make it impossible to perform even the most basic analysis and therefore I would leave all of the statistical analysis in R.

I'm sure there are many more advantages/disadvantages that can be compared between R and using a databases. Again, if you are working on small amounts of data for fun you can use R all the way. Otherwise, use both. Its easier, faster and a lot more comfortable.

AYR
  • 950
  • 3
  • 11
  • 22
  • 1
    I greatly appreciate your feedback. However, as a newbie, I would push back at your final comment; I do not find it easy, fast, or comfortable to go back-and-forth between using a database and `r`, and I do not fully understand how they can be used together effectively. You mentioned that using `r` I can download data and upload it to a database where it can be compressed efficiently for performance and scalability. How? Using `sqldf` I can `ATTACH` a database to my local machine, which I can continue to use in `r`, but it is not a small file, and I think this is probably inefficient. – Jayden.Cameron Apr 05 '17 at 09:43
  • There is no going "back-and-forth" between db and R. Think of the DB as your file storage system and R as your platform. You can make use of the RODBC package to connect to a database and the sqlSave function to save the data to the database (new table, append, etc). I'm not talking about using sqldf. That's specifically for querying data objects in R using SQL as opposed to querying a databases. Regardless of the question and your environment, if you really want to progress in this field I strongly suggest becoming familiar with basic DB stuff. You can easily set up a MySQL database locally. – AYR Apr 05 '17 at 11:51
  • I agree with you that I need to increase my basic DB knowledge. I'm afraid my initial question may not yield an answer which is 100% satisfactory to me, owing to my lack of DB knowledge. I've gone through a Coursera course on SQL, which taught me how to use a variety of queries. I've learned how to use `sqldf` to use those queries on dataframes in `r`. However, I am still clueless about how to create a DB locally. I will take your suggestion and start getting familiar with MySQL specifically. Any more pedagogical tools you may suggest are appreciated. – Jayden.Cameron Apr 05 '17 at 13:32