3

I'm trying to convert characters to dates. The dates have heterogeneous formats, and I'd prefer to not have to code each date format separately (and specify which formats belong to which elements, through, e.g., grepl).

Here's my test data:

test <- c(
  "2012-11-11", "12-5-23", "12/5/86", "2015-12-16 1300", 
  "8/6/92 3:00", "11/6/14 4", "10/31/14 52", 
  "06/15/2014 14:37", "2/10/06", "95-06-26", "82-10-03"
)

Desired result:

as.POSIXct(c("2012-11-11 00:00:00 UTC", "2012-05-23 00:00:00 UTC", "1986-12-05 00:00:00 UTC", "2015-12-16 13:00:00 UTC", "1992-08-06 03:00:00 UTC", "2011-06-14 04:00:00 UTC", "2014-10-31 00:52:00 UTC", "2014-06-15 14:37:00 UTC", "2006-02-10 00:00:00 UTC", "1995-06-26 00:00:00 UTC", "1982-10-03 00:00:00 UTC"), tz="UTC")

I realize that some of the weird time in test (like 4, or 53) might be impossible unequivocally parse, but for those I mostly want to make sure that the random time doesn't screw up the date.

Here's my best attempt:

orders <- paste(rep(c("ymd", "mdy", "Ymd"),each=3), c("HM","H","M"))
lubridate::parse_date_time(test, orders=orders, truncated=2)

[1] "2012-11-11 00:00:00 UTC" "0012-05-23 00:00:00 UTC" "1986-12-05 00:00:00 UTC" "2015-12-16 13:00:00 UTC" "0092-08-06 03:00:00 UTC" "2011-06-14 04:00:00 UTC"
[7] "2014-10-31 00:52:00 UTC" "2014-06-15 14:37:00 UTC" "2006-02-10 00:00:00 UTC" "0095-06-26 00:00:00 UTC" "0082-10-03 00:00:00 UTC"

The problem is that it gets the century wrong on the 2-digit years. What surprising is that this works fine:

parse_date_time(test[2], orders=orders[1], truncated=2)
[1] "2012-05-23 UTC"

In ?parse_date_time there is a note:

NOTE: ymd family of functions are based on strptime which currently fails to parse %y-%m formats.

But that shouldn't apply here because 1) I have day and 2) it seems to work when I only have a date with a year w/o century. I think the problem in my specification is in how it interacts with the guessing/ training.

Lubridate is so close to making this tiresome task a lot easier, I think. Is there a way to get lubridate , or any other method, to generically parse the dates in test?

rbatt
  • 4,407
  • 4
  • 20
  • 39
  • It's simply not possible. 12-1-5 – Casimir et Hippolyte Dec 16 '15 at 19:25
  • @CasimiretHippolyte there are cases that are ambiguous. But the context matters, too. I keep thinking about this. The formats in `test` don't always exist in the exact same data set; so maybe my example is misleading. Any given data set might have 1-3 formats, but within the data set they don't usually have variable formats that would lead to ambiguity. E.g., might have 12-12-12, 12-1-23, 1923-12-14 in one. – rbatt Dec 16 '15 at 19:29

1 Answers1

1

This answer only addresses the particular symptom of the wrong century; although that's the only problem I noticed in my test data. I would be happy to see someone else's approach to robustly converting dates; if I've learned anything, it that there are always more surprises in date formats, and that they usually have solutions. But for now, by heavily relying on the linked answer, this is my best:

foo <- function(x, orders, year=1940, ...){
  requireNamespace("lubridate", quietly=TRUE)
  x <- lubridate::parse_date_time(x, orders=orders, ...)
  m <- lubridate::year(x) %% 100
  year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
  x
}

orders <- paste(rep(c("ymd", "mdy", "Ymd"),each=3), c("HM","H","M"))
foo(test, orders, truncated=2)

 [1] "2012-11-11 00:00:00 UTC" "2012-05-23 00:00:00 UTC" "1986-12-05 00:00:00 UTC" "2015-12-16 13:00:00 UTC" "1992-08-06 03:00:00 UTC" "2011-06-14 04:00:00 UTC"
 [7] "2014-10-31 00:52:00 UTC" "2014-06-15 14:37:00 UTC" "2006-02-10 00:00:00 UTC" "1995-06-26 00:00:00 UTC" "1982-10-03 00:00:00 UTC"

the year argument should basically be the earliest year present in your vector of character dates, x.

Community
  • 1
  • 1
rbatt
  • 4,407
  • 4
  • 20
  • 39