3

When I create an rpart tree that uses a date cutoff at a node, the print methods I use - both rpart.plot and fancyRpartPlot - print the dates in scientific notation, which makes it hard to interpret the result. Here's the fancyRpartPlot:

enter image description here

Is there a way to print this tree with more interpretable date values? This tree plot is meaningless as all those dates look the same.

Here's my code for creating the tree and plotting two ways:

library(rpart) ; library(rpart.plot) ; library(rattle)
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree)
fancyRpartPlot(my_tree)

Using this data:

# define a random date/time selection function
generate_days <- function(N, st="2012/01/01", et="2012/12/31") {
  st = as.POSIXct(as.Date(st))
  et = as.POSIXct(as.Date(et))
  dt = as.numeric(difftime(et,st,unit="sec"))
  ev = runif(N, 0, dt)
  rt = st + ev
  rt
}

set.seed(1)
dat <- data.frame(
  a = runif(1:100),
  b = rpois(100, 5),
  c = sample(c("hi","med","lo"), 100, TRUE),
  d = generate_days(100)
)
Sam Firke
  • 17,062
  • 6
  • 70
  • 83

3 Answers3

4

From a practical standpoint, perhaps you'd like to just use days from the start of the data:

dat$d <- dat$d-as.POSIXct(as.Date("2012/01/01"))
my_tree <- rpart(a ~ ., data = dat)
rpart.plot(my_tree,branch=1,extra=101,type=1,nn=TRUE)

enter image description here

This reduces the number to something manageable and meaningful (though not as meaningful as a specific date, perhaps). You may even want to round it to the nearest day or week. (I can't install GTK+ on my computer so I can't us fancyRpartPlot.)

Sam Dickson
  • 4,711
  • 1
  • 20
  • 40
  • 1
    +1 as this is the best option so far and beats the current scientific notation. I'm still hoping to get the date value printed as a date in the tree, so will keep the question open. – Sam Firke Jan 15 '16 at 18:00
1

One possible way might be to use the digits options in print to examine the tree and as.POSIXlt to convert to date:

> print(my_tree,digits=100)
n= 100

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 7.0885590 0.5178471
   2) d>=1346478795.049611568450927734375 33 1.7406368 0.4136051
     4) b>=4.5 23 1.0294497 0.3654257 *
     5) b< 4.5 10 0.5350040 0.5244177 *
   3) d< 1346478795.049611568450927734375 67 4.8127122 0.5691901
     6) d< 1340921905.3460228443145751953125 55 4.1140164 0.5368048
      12) c=hi 28 1.8580913 0.4779574
        24) d< 1335890083.3241622447967529296875 18 0.7796261 0.3806526 *
        25) d>=1335890083.3241622447967529296875 10 0.6012662 0.6531062 *
      13) c=lo,med 27 2.0584052 0.5978317
        26) d>=1337494347.697483539581298828125 8 0.4785274 0.3843749 *
        27) d< 1337494347.697483539581298828125 19 1.0618892 0.6877082 *
     7) d>=1340921905.3460228443145751953125 12 0.3766236 0.7176229 *

## Get date on first node
> as.POSIXlt(1346478795.049611568450927734375,origin="1970-01-01")
[1] "2012-08-31 22:53:15 PDT"

I also check the digits option in available in rpart.plot and fancyRpartPlot:

rpart.plot(my_tree,digits=10)
fancyRpartPlot(my_tree, digits=10)
fishtank
  • 3,460
  • 1
  • 11
  • 16
0

I don't know how important the specific chronological date is in your classification but an alternative method would be to breakdown your dates by the characteristics. In other words, create bins based on the "year" (2012,2013,2014...) as [1,0]. "Day of the Week" (Mon, Tues, Wed, Thurs, Fri...) as [1,0]. Maybe even as "Day of Month" (1,2,3,4,5...31) as [1,0]. This adds a lot more categories to be classifying by but it eliminates the issue with working with a fully formatted date.