2

I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (RedCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.

Let's say we use the following numbers:

dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)

I convert these into string, and then apply the as.Date() function:

datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal

The problems i have are:

  • Five-digit numbers (eg 11230) gets reported as NA.
  • Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01

I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.

Greetings Bjorn

  • This may help [Convert Excel numeric to date](https://stackoverflow.com/questions/47093228/convert-excel-numeric-to-date) – NelsonGon Jun 29 '20 at 14:44
  • 1
    @NelsonGon, I think not, since Excel dates are numeric days, these appear to be number-encoded `dmy`. – r2evans Jun 29 '20 at 15:00
  • *"Unfortunately the cloud-based database (RedCap) output of this gets formatted as a number"*, losing the century ... this is the problem that likely needs to be addressed *first*. Output as a number is one thing, but losing the century is (IMO) egregious. – r2evans Jun 29 '20 at 15:07
  • Related: [Add correct century to dates with year provided as “Year without century”, %y](https://stackoverflow.com/questions/9508747/add-correct-century-to-dates-with-year-provided-as-year-without-century-y); [Is there a more elegant way to convert two-digit years to four-digit years with lubridate?](https://stackoverflow.com/questions/12323693/is-there-a-more-elegant-way-to-convert-two-digit-years-to-four-digit-years-with) – Henrik Jun 29 '20 at 15:46

3 Answers3

3

If your 5-digit numbers really just need to be zero-padded, then

dato_s <- sprintf("%06d", dato)
dato_s
# [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"

From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:

 '%y' Year without century (00-99).  On input, values 00 to 68 are
      prefixed by 20 and 69 to 99 by 19 - that is the behaviour
      specified by the 2018 POSIX standard, but it does also say
      'it is expected that in a future version the default century
      inferred from a 2-digit year will change'.

So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).

dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
                       gsub("([5-9][0-9])$", "19\\1", dato_s)),
                  format = "%d%m%Y")
dato_d
# [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "2030-12-05" "1969-12-05" "2004-12-05"

In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".

Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:

dato_s <- sprintf("%06d", dato)
dato_d <- as.Date(dato_s, format = "%d%m%y")
dato_d[ dato_d > Sys.Date() ] <-
  as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
dato_d
# [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
# [7] "1930-12-05" "1969-12-05" "2004-12-05"
r2evans
  • 77,184
  • 4
  • 55
  • 96
3

You can make this a bit easier using lubridate, and noting that no-one can have a date of birth that is in the future of the current time:

library(lubridate)

dato <- dmy(sprintf("%06d", dato))
dato[dato > now()] <- dato[dato > now()] - years(100)

dato
#> [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31"
#> [6] "2004-11-20" "1930-12-05" "1969-12-05" "2004-12-05"

Of course, without further information, this method (nor any other method) will be able to pick out the edge cases of people who are aged over 100. This might be easy to determine from the context.

Created on 2020-06-29 by the reprex package (v0.3.0)

Allan Cameron
  • 56,042
  • 3
  • 16
  • 39
0

Converting five digit "numbers" to six digits is straightforward: x <- stringr::str_pad(x, 6, pad="0") or similar will do the trick.

Your problem with years is the Millennium bug revisited. You'll have to consult with whoever compiled your data to see what assumptions they used.

I suspect all dates on or before 31Dec1970 are affected, not just those before 01Jan1960. That's because as.Date uses a default origin of 01Jan1970 when deciding how to handle two digit years. So your solution is to pick an appropriate origin in your conversion to fix this dataset. Something like d <- as.Date(x, origin="1900-01-01"). And then start using four digit years in the fiture! ;)

Limey
  • 4,618
  • 2
  • 5
  • 21
  • Thanks for great suggestions everybody (@Limey, r2evans, Allan Cameron, Bob Jarvis) The problem with missing leading zero was elegantly solved by both **datostr Sys.Date(), format(datofinal, "19%y-%m-%d"), format(datofinal)))** Thanks again for very helpful advice. Bjorn – Bjorn Blomberg Jun 30 '20 at 05:57