3

Is it possible to differentiate an ECDF? Take the one obtained in the following for example example.

set.seed(1)

a <- sort(rnorm(100))
b <- ecdf(a)

plot(b)

enter image description here

I would like to take the derivative of b in order to obtain its probability density function (PDF).

李哲源
  • 59,090
  • 15
  • 146
  • 206
MaxPlank
  • 125
  • 2
  • 9

1 Answers1

5
n <- length(a)  ## `a` must be sorted in non-decreasing order already
plot(a, 1:n / n, type = "s")  ## "staircase" plot; not "line" plot

However I'm looking to find the derivative of b

In samples-based statistics, estimated density (for a continuous random variable) is not obtained from ECDF by differentiation, because the sample size is finite and and ECDF is not differentiable. Instead, we estimate the density directly. I guess plot(density(a)) is what you are really looking for.


a few days later..

Warning: the following is just a numerical solution without statistical ground!

I take it as an exercise to learn about R package scam for shape constrained additive models, a child package of mgcv by Prof Wood's early PhD student Dr Pya.

The logic is as such:

  • using scam::scam, fit a monotonically increasing P-spline to ECDF (you have to specify how many knots you want); [Note that monotonicity is not the only theoretical constraint. It is required that the smoothed ECDF are "clipped" on its two edges: the left edge at 0 and the right edge at 1. I am currently using weights to impose such constraint, by giving very large weight at two edges]
  • using stats::splinefun, reparametrize the fitted spline with a monotonic interpolation spline through knots and predicted values at knots;
  • return the interpolation spline function, which can also evaluate the 1st, 2nd and 3rd derivatives.

Why I expect this to work:

As sample size grows,

  • ECDF converges to CDF;
  • P-spline is consistent so a smoothed ECDF will be increasingly unbiased for ECDF;
  • the 1st derivative of smoothed ECDF will be increasingly unbiased for PDF.

Use with caution:

  • You have to choose number of knots yourself;
  • the derivative is NOT normalized so that the area under the curve is 1;
  • the result can be rather unstable, and is only good for large sample size.

function arguments:

  • x: a vector of samples;
  • n.knots: number of knots;
  • n.cells: number of grid points when plotting derivative function

You need to install scam package from CRAN.

library(scam)

test <- function (x, n.knots, n.cells) {

  ## get ECDF
  n <- length(x)
  x <- sort(x)
  y <- 1:n / n
  dat <- data.frame(x = x, y = y)  ## make sure `scam` can find `x` and `y`

  ## fit a monotonically increasing P-spline for ECDF
  fit <- scam::scam(y ~ s(x, bs = "mpi", k = n.knots), data = dat,
                    weights = c(n, rep(1, n - 2), 10 * n))
  ## interior knots
  xk <- with(fit$smooth[[1]], knots[4:(length(knots) - 3)])
  ## spline values at interior knots
  yk <- predict(fit, newdata = data.frame(x = xk))
  ## reparametrization into a monotone interpolation spline
  f <- stats::splinefun(xk, yk, "hyman")

  par(mfrow = c(1, 2))

  plot(x, y, pch = 19, col = "gray")  ## ECDF
  lines(x, f(x), type = "l")          ## smoothed ECDF
  title(paste0("number of knots: ", n.knots,
               "\neffective degree of freedom: ", round(sum(fit$edf), 2)),
        cex.main = 0.8)

  xg <- seq(min(x), max(x), length = n.cells)
  plot(xg, f(xg, 1), type = "l")     ## density estimated by scam
  lines(stats::density(x), col = 2)  ## a proper density estimate by density

  ## return smooth ECDF function
  f
  }

## try large sample size
set.seed(1)
x <- rnorm(1000)
f <- test(x, n.knots = 20, n.cells = 100)

test

f is a function as returned by stats::splinefun (read ?splinefun).

A naive, similar solution is to do interpolation spline on ECDF without smoothing. But this is a very bad idea, as we have no consistency.

g <- splinefun(sort(x), 1:length(x) / length(x), method = "hyman")
curve(g(x, deriv = 1), from = -3, to = 3)

enter image description here

A reminder: it is highly recommended to use stats::density for a direct density estimation.

李哲源
  • 59,090
  • 15
  • 146
  • 206
  • Could you point to some follow up steps on "density estimation"? Currently I am working with a data set which features some very fine structures that I do not consider negligible. However using `stats::density` will smooth these structures out and in general this function seems to have a lot of internal settings that make me uncomfortable about what is actually being represented. So what you mean by "direct density **estimation**" is basically to get a rough idea of what your distribution is? – algae Feb 04 '20 at 02:44