38

If a date vector has two-digit years, mdy() turns years between 00 and 68 into 21st Century years and years between 69 and 99 into 20th Century years. For example:

library(lubridate)    
mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))

gives the following output:

Multiple format matches with 5 successes: %m/%d/%y, %m/%d/%Y.
Using date format %m/%d/%y.
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC" "2004-01-02 UTC"

I can fix this after the fact by subtracting 100 from the incorrect dates to turn 2054 and 2068 into 1954 and 1968. But is there a more elegant and less error-prone method of parsing two-digit dates so that they get handled correctly in the parsing process itself?

Update: After @JoshuaUlrich pointed me to strptime I found this question, which deals with an issue similar to mine, but using base R.

It seems like a nice addition to date handling in R would be some way to handle century selection cutoffs for two-digit dates within the date parsing functions.

Community
  • 1
  • 1
eipi10
  • 81,881
  • 20
  • 176
  • 248
  • 4
    Technically, the dates are being parsed correctly, since the documentation (`?strptime`) states that: "On input, values 00 to 68 are prefixed by 20 and 69 to 99 by 19 - that is the behaviour specified by the 2004 and 2008 POSIX standards". `?parse_date` briefly tells you to see `?strptime` for the formats. – Joshua Ulrich Sep 07 '12 at 18:56
  • 2
    I should have been more precise. I didn't mean to imply that `lubridate` has a bug, but merely that because of the ambiguity of two-digit years, the package's natural behavior results in incorrect four-digit years ("incorrect" in the sense of "not the desired result") under some relatively common situations. I was hoping that there was some way within `lubridate` to specify a "switch" or "cutoff" value that will give the desired century for given ranges of two-digit dates. – eipi10 Sep 07 '12 at 22:01
  • 2
    Suggest you submit a feature request to lubridate's github page. – Spacedman Oct 18 '12 at 09:09
  • As suggested by Spacedman, I've added a feature request to @Hadley's github page. – eipi10 Oct 23 '12 at 15:54
  • 1
    Note that this is no longer lubridate's behavior. For your example, the output is now `[1] "2054-01-02 UTC" "2068-01-02 UTC" "2069-01-02 UTC" "2099-01-02 UTC" "2004-01-02 UTC"`. – Frank Aug 07 '15 at 01:17
  • How can force lubridate to use a different cutoff year instead of 1969? – skan Jun 27 '17 at 15:06
  • 4
    @skan, `parse_date_time2` has a cutoff_2000 option. Here you can specify a different cutoff year. – phiver Jan 20 '18 at 19:44
  • @phiver's tip is great. Unfortunately, one can't use multiple date formats in the `parse_date_time`'s `orders` argument. So (e.g.) can't match both two and four digit years with `c("dmY", "dmy")`. (Workaround would be to copy date column first and perform operation twice, but it's clumsy) – EcologyTom Jul 17 '18 at 14:27

3 Answers3

34

Here is a function that allows you to do this:

library(lubridate)
x <- mdy(c("1/2/54","1/2/68","1/2/69","1/2/99","1/2/04"))


foo <- function(x, year=1968){
  m <- year(x) %% 100
  year(x) <- ifelse(m > year %% 100, 1900+m, 2000+m)
  x
}

Try it out:

x
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"

foo(x)
[1] "2054-01-02 UTC" "2068-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"

foo(x, 1950)
[1] "1954-01-02 UTC" "1968-01-02 UTC" "1969-01-02 UTC" "1999-01-02 UTC"
[5] "2004-01-02 UTC"

The bit of magic here is to use the modulus operator %% to return the fraction part of a division. So 1968 %% 100 yields 68.

Andrie
  • 163,419
  • 39
  • 422
  • 472
3

I just experienced this exact same bug / feature.

I ended up writing the following two quick functions to help convert from excel-type dates (which is where i get this most) to something R can use.

There's nothing wrong with the accepted answer -- it's just that i prefer not to load up on packages too much.

First, a helper to split and replace the years ...

year1900 <- function(dd_y, yrFlip = 50)
{
    dd_y <- as.numeric(dd_y)
    dd_y[dd_y > yrFlip] <- dd_y[dd_y > yrFlip] + 1900
    dd_y[dd_y < yrFlip] <- dd_y[dd_y < yrFlip] + 2000
    return(dd_y)
}

which is used by a function that 'fixes' your excel dates, depending on type:

XLdate <- function(Xd, type = 'b-Y')
{
    switch(type,
        'b-Y' = as.Date(paste0(substr(Xd, 5, 9), "-", substr(Xd, 1, 3), "-01"), format = "%Y-%b-%d"),
        'b-y' = as.Date(paste0(year1900(substr(Xd, 5, 6)), "-", substr(Xd, 1, 3), "-01"), 
                        format = "%Y-%b-%d"),
        'Y-b' = as.Date(paste0(substr(Xd, 1, 3), "-", substr(Xd, 5, 9), "-01"), format =     "%Y-%b-%d")
        )
}

Hope this helps.

ricardo
  • 7,345
  • 6
  • 39
  • 64
0

Another option would be:

xxx <- c("01-Jan-54","01-Feb-68","01-Aug-69","01-May-99","01-Jun-04", "
       31-Dec-68","01-Jan-69", "31-Dec-99")

.

dmy(paste0(sub("\\d\\d$","",xxx) , ifelse( (tt <- 
   sub("\\d\\d-\\D\\D\\D-","",xxx)  ) > 20 ,paste0("19",tt),paste0("20",tt))))

Though no solution is elegant nor short. I think it would be better if lubridate just added an option to specify the cutoff date.

skan
  • 6,456
  • 11
  • 47
  • 83