2

I have the following data (sample) with some leap minutes (e.g. 6:32 and 6:33 are missing). For those cases, count equals 0 but the database just didn't report them and leaps the minutes.

count   time
47  15/12/2014 06:30
3   15/12/2014 06:31
431 15/12/2014 06:34
320 15/12/2014 06:35
42  15/12/2014 06:36
13  15/12/2014 06:37
383 15/12/2014 06:38
160 15/12/2014 06:39

I have tried to follow other posts (I,II,III) where they used the xts package but it didn't work. I tried my own approach but it didn't work either:

sort.df <- df[order(df$time),]
time.min <- min(sort.df$time)
time.max <- max(sort.df$time)
all.dates <- seq(time.min, time.max, by="min") # I create a list of all the minutes. 
all.dates.frame <- data.frame(list(time=all.dates))
merged.data <- merge(all.dates.frame, sorted.data, all=T) 

All I get is all the minutes duplicated with NA values. Anyone know what I am doing wrong? Any help/ideas hugely appreciated!

Jaap
  • 71,900
  • 30
  • 164
  • 175
user3507584
  • 2,670
  • 5
  • 30
  • 54
  • Are you working on many data frames at the same time or is it just one data frame with where you need to adjust the minutes? And what is the format of your "time" column? – talat Dec 15 '14 at 10:41
  • This bit will be inside a loop where I read csv files, do some algebra and plot it. So I will work with many data frames but not at the same time. – user3507584 Dec 15 '14 at 10:43
  • Please edit your question to give the output of `dput(head(df))`. – Stephan Kolassa Dec 15 '14 at 10:45
  • @StephanKolassa , please see the changes. Thanks for the help – user3507584 Dec 15 '14 at 10:49

3 Answers3

4

How about this - it works for a small sample data:

Your input data:

df <- read.table(header=T, text='count   time
47  "15/12/2014 06:30"
3   "15/12/2014 06:31"
431 "15/12/2014 06:34"
320 "15/12/2014 06:35"
42  "15/12/2014 06:36"
13  "15/12/2014 06:37"
383 "15/12/2014 06:38"
160 "15/12/2014 06:39"')

Format the "time" column:

df$time <- as.POSIXct(df$time, format = "%d/%m/%Y %H:%M")

Create a new data.frame with all the minutes:

newdf <- data.frame(time = seq(min(df$time), max(df$time), by = "mins"))

Then merge it with the original data:

merge(newdf, df, by = "time", all.x = TRUE)
#                  time count
#1  2014-12-15 06:30:00    47
#2  2014-12-15 06:31:00     3
#3  2014-12-15 06:32:00    NA
#4  2014-12-15 06:33:00    NA
#5  2014-12-15 06:34:00   431
#6  2014-12-15 06:35:00   320
#7  2014-12-15 06:36:00    42
#8  2014-12-15 06:37:00    13
#9  2014-12-15 06:38:00   383
#10 2014-12-15 06:39:00   160
talat
  • 62,625
  • 18
  • 110
  • 141
  • It worked perfectly. Thanks. Just one question, was it the function of the all.x = TRUE? I think that was the bit missing – user3507584 Dec 15 '14 at 11:05
  • 1
    Yes probably that was the missing bit – talat Dec 15 '14 at 11:12
  • Answer: meaning of `all.x` [where x is the first data frame and y the second data frame merged] logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output – user3507584 Dec 15 '14 at 11:15
2

Much of the manipulation you are doing is automatically performed if you use a time series representation such as zoo or xts. There are examples of this in the zoo vignettes but here it is again. g is a grid of times and based on it we merge a zero-width series with such times together with z to get the result:

# test data
Lines <- "count,time
47,15/12/2014 06:30
3,15/12/2014 06:31
431,15/12/2014 06:34
320,15/12/2014 06:35
42,15/12/2014 06:36
13,15/12/2014 06:37
383,15/12/2014 06:38
160,15/12/2014 06:39"

library(zoo)
df <- read.csv(text = Lines)

# convert to zoo
fmt <- "%d/%m/%Y %H:%M"
z <- read.zoo(df, index = 2, tz = "", format = fmt)

# create grid and merge 0-width series based on it with z
g <- seq(start(z), end(z), by = "min") # grid of times
merge(z, zoo(, g))

giving:

2014-12-15 06:30:00 2014-12-15 06:31:00 2014-12-15 06:32:00 2014-12-15 06:33:00 
                 47                   3                  NA                  NA 
2014-12-15 06:34:00 2014-12-15 06:35:00 2014-12-15 06:36:00 2014-12-15 06:37:00 
                431                 320                  42                  13 
2014-12-15 06:38:00 2014-12-15 06:39:00 
                383                 160 

If we were starting out from an input file rather than a data frame df then we could combine the read.csv and read.zoo statements into just a read.zoo statement:

z <- read.zoo(text = Lines, header = TRUE, sep = ",", index = 2, tz = "", format =fmt)
G. Grothendieck
  • 211,268
  • 15
  • 177
  • 297
1

This is now conveniently implemented in the package padr. If your data frame is prepared as is done by docendo (with date-time saved as POSIXct), this is all you need:

library(padr)
pad(df)

See vignette("padr") for its working.

Edwin
  • 2,894
  • 19
  • 22