I have been trying to properly clean and format raw Item price data for (time-series) analysis and I'm curious how any of you professionals would tackle this kind of set up. Every two columns represents a list of dates and a list of prices. These dates are (unfortunately) independent of any other date in the same row (although may have many as the same by happenstance).
My tactic here is to make a new data frame with rows representing days, columns representing prices, and run a loop that matches the Item dates to the correct row and fills in the correct price.
However, I believe I may be doing this inefficiently and my online searches aren't giving me other examples of this procedure.
Below please find example code.
df <- structure(list(Date1 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item1 = c(650L, 650L, 635L, 640L, 640L, 625L, 620L, 580L, 550L, 520L, 530L), Date2 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item2 = c(590L, 590L, 590L, 580L, 580L, 580L, 580L, 580L, 460L, 460L, 395L), Date3 = c("12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012", "1/23/2012", "1/30/2012", "2/6/2012", "2/13/2012", "2/20/2012"), Item3 = c(775L, 775L, 775L, 750L, 750L, 750L, 750L, 750L, 725L, 725L, 740L), Date4 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item4 = c(660L, 700L, 700L, 700L, 700L, 700L, 650L, 650L, 650L, 650L, 610L), Date5 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item5 = c(705L, 705L, 705L, 650L, 650L, 650L, 650L, 555L, 555L, 555L, 555L), Date6 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item6 = c(612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L), Date7 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item7 = c(630L, 630L, 625L, 635L, 625L, 615L, 620L, 560L, 550L, 540L, 530L), Date8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Item8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date9 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item9 = c(622L, 622L, 650L, 650L, 650L, 660L, 660L, 660L, 665L, 665L, 665L), Date10 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item10 = c(1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L)), .Names = c("Date1", "Item1", "Date2", "Item2", "Date3", "Item3", "Date4", "Item4", "Date5", "Item5", "Date6", "Item6", "Date7", "Item7", "Date8", "Item8", "Date9", "Item9", "Date10", "Item10"), row.names = 95:105, class = "data.frame")
df
class(df)
# visual inspection for first and last date (10/31/2011, 2/20/2012)
mdyyyy <- function(x){as.Date(x,"%m/%d/%Y")}
days <- seq.Date(from = mdyyyy("10/31/2011"), # first date
to = mdyyyy("2/20/2012"), # last date
by = "day")
head(days)
datecolumns <- seq(1,ncol(df),by=2) # (odds) date columns
pricecolumns <- seq(2,ncol(df),by=2) # (evens) index columns
# Creating a new, cleaned matrix of data where the
# rows = days and columns = indices
newdat <- matrix(NA,
length(days),
ncol(df[,pricecolumns])) # indices wide
# Name rows
rownames(newdat) <- format(days,"%m/%d/%Y")
# Each row is a new day
head(newdat[,1:10])
# Placing prices into the appropriate rows
for(i in 1:length(datecolumns)){
pricedates <- 0 # initialize/reset
pricedates <- mdyyyy(df[,datecolumns[i]]) # column's price dates
rowlocations <- 0 # initialize/reset
rowlocations <- match(pricedates, days) # date's new row number
for(j in 1:length(rowlocations)){
# within each cell, place appropriate price
newdat[rowlocations[j],i] <- df[j,pricecolumns[i]]
}
}
colnames(newdat) <- colnames(df[,pricecolumns])
head(newdat)
Afterwards I've been looking to the xts package to help me make this into something I can analyze by doing apply.monthly() and rollapply() since the raw data is much more extensive.
Thank you very much for your thoughts and criticisms.