Given that you are looking for a "workflow best practices", there's a high premium that should be placed on both reproducibility and transparency. Since your goal is data analysis and not data collection or database management, there's not a strong reason to be creating your own databases and a custom database would likely make your workflow and analysis less transparent. In short, if you don't need to build a database, don't.
It sounds like your workflow is as follows:
- Download data from public sources (ideally .csv or a similarly nice format)
- Clean and process the data
- Run analysis on the (potentially linked) cleaned data
I would recommend dividing your workflow into two distinct steps:
1. Download and Clean Data
If your files are all .csv (or other regular delimited files), then you just need the data.table
package for this step. You can write a single R script to download, clean, and save the data you need. A minimal example is below:
# Download data
library(data.table)
salary_data <- fread('https://data.phila.gov/api/views/25gh-t2gp/rows.csv')
# Clean data (only looking at City Council salaries)
cleaned_data <- salary_data[Department == 'CITY COUNCIL']
# Saving cleaned data
save(cleaned_data, file = 'my_file_name.rda', compress = TRUE)
Ideally, you would only have to run this file once to generate the dataset you actually perform your statistical analysis on. If you decide to clean or process your data differently, just revisit this file, make the appropriate changes, and rerun it. I would recommend having one script for each file that you are downloading so that it is easy to see how you are processing the raw data straight from the source (transparency). Simply having this file satisfies reproducibility.
2. Statistical Analysis
If you need to combine your datasets, data.table
provides a fast and transparent way of doing this. Simply load in your cleaned individual datasets, identify the key that you will use to merge them, and then merge them. Then run your analysis on the merged dataset. Below is an example of this functionality:
# dt1 has salary information for 10 people and dt2
# has the number of kids for the same 10 people
library(data.table)
dt1 <- data.table(id = 1:10, salary = sample(0:100000, 10)
dt2 <- data.table(id = 1:10, kids = sample(0:5, 10)
save(dt1, file = 'dt1.rda', compress = TRUE)
save(dt2, file = 'dt2.rda', compress = TRUE)
# Loading and merging data
load(file = 'dt1.rda')
load(file = 'dt2.rda')
setkey(dt1, id)
setkey(dt2, id)
merged_dt <- merge(dt1, dt2)
# Doing regression analysis on merged data
reg <- lm(salary ~ kids, data = merged_dt)
This makes the merging procedure and subsequent analysis transparent and reproducible.
Summary
This procedure ensures that your data sources, data cleaning/processing, and analysis are well-documented, transparent, and reproducible. Furthermore, this procedure scales with your computer. If you do not need to build a database, then don't.
What if the data is too big for my computer? If you need more space, just run the code you've already written on a dedicated server or Amazon Web Services machine.
What if the data is too big for a dedicated server? Chances are the data is stored in an actual database and the only piece of the workflow that changes is your data downloading and (potentially) some of the processing will be a SQL query to the database (most likely using the DBI
package which runs SQL queries in R), which then should be small enough to run locally or on a dedicated server.
What if my data is too big for that? You probably should look into more heavy-duty big data languages like Hadoop.
Supplemental Note: If your data is not in a regular delimited format (such as an Excel, SAS, or Stata file), then I would recommend using the download_file()
function coupled with the tidyverse
package (which has a fantastic ability to read these less pleasant, but common files)
library(tidyverse)
taxi_data_excel <- download.file(url = 'http://www.nyc.gov/html/tlc/downloads/excel/current_medallion_drivers.xls', destfile = 'taxi_data_excel.xls')
taxi_data <- read_excel('taxi_data_excel.xls')
Then do your cleaning as usual.