1

I want to compare a reference distribution d_1 with a sample d_2 drawn proportionally to size w_1 using the Kolmogorov–Smirnov distance.

Given that d_2 is weighted, I was considering accounting for this using the Weighted Empirical Cumulative Distribution Function in R (using ewcdf {spatstat}).

The example below shows that I am probably miss-specifying the weights, because when lenght(d_1) == lenght(d_2) the Kolmogorov–Smirnov is not giving a value of 0.

Can someone help me with this? For clarity, see the reproducible example below.

#loop for testing sample sizes 1:length(d_1)
d_stat <- data.frame(1:1000, rep(NA, 1000))
names(d_stat) <- c("sample_size", "ks_distance")

for (i in 1:1000) {

#reference distribution
d_1 <- rpois(1000, 500)
w_1 <- d_1/sum(d_1)
m_1 <- data.frame(d_1, w_1)

#sample from the reference distribution
m_2 <-m_1[(sample(nrow(m_1), size=i, prob=w_1, replace=F)),]
d_2 <- m_2$d_1
w_2 <- m_2$w_1

#ewcdf for the reference distribution and the sample
f_d_1 <- ewcdf(d_1)
f_d_2 <- ewcdf(d_2, 1/w_2, normalise=F, adjust=1/length(d_2))

#kolmogorov-smirnov distance
d_stat[i,2] <- max(abs(f_d_1(d_2) - f_d_2(d_2)))
}

d_stat[1000,2]
Gion Mors
  • 512
  • 1
  • 3
  • 16

2 Answers2

1

I don’t quite understand what you are trying to do here. Why would you expect ewcdf(d_1) and ewcdf(d_2, w_2, normalise=F) to give the same result for i=1000? The first one is the usual ecdf which jumps at the unique values of the input vector with a jump size determined by the number of times the value is repeated (more ties – larger jumps). The second one jumps at the same unique values with a height determined by the sum of the weights you have provided.

What does give identical results is ewcdf(d_2, w_2) and ewcdf(d_1, w_1), but this is not the same as ewcdf(d_1). To understand why the latter two are different, I would suggest a much smaller handmade example with a couple of ties:

library(spatstat)
#> Loading required package: spatstat.data
#> Loading required package: nlme
#> Loading required package: rpart
#> 
#> spatstat 1.60-1.006       (nickname: 'See Above') 
#> For an introduction to spatstat, type 'beginner'
x <- c(1,2,3,3,4)
e <- ewcdf(x)

This is the usual ecdf which jumps with value 1/5 at x=1, 1/5 at x=2, 2*1/5 at x=3 and 1/5 at x=4:

plot(e)

Now you define the weights as:

w <- x/sum(x)
w
#> [1] 0.07692308 0.15384615 0.23076923 0.23076923 0.30769231

Thus the ewcdf will jump with value 1/13 at x=1, 2/13 at x=2, 2*3/13 at x=3 and 4/13 at x=4 (with the usual ecdf overlayed in red):

plot(ewcdf(x, w, normalise = FALSE), axes = FALSE)
axis(1)
axis(2, at = (0:13)/13, labels = c("0", paste(1:13, 13, sep = "/")), las = 2 )
abline(h = cumsum(c(1,2,6,4)/13), lty = 3, col = "gray")
plot(e, add = TRUE, col = "red")

Ege Rubak
  • 3,547
  • 1
  • 8
  • 17
  • Thank you for the detailed explaination. It appears that I totally misunderstud the concept of weights in the ewcdf. What I am trying to test here is how well a sample drawn proportionally to size can approximate the distribution of the entire population. Normally, if I would like to compare the mean, I would compare the mean of the population with a weighted mean of the sample. It doesn't look like I can do something similar with the ecdf. – Gion Mors Sep 03 '19 at 08:59
  • Just to be sure, my idea is to compare the true ecdf on the full dataset with the weighted ecdf on a sampled dataset to see if the sampling strategy manages to retrieve the true distribution. I've edited my question: weights of sampled units are of course the inverse of their probability to be selected – Gion Mors Sep 03 '19 at 09:55
  • Could you be even more specific? If you have a population of {1,2,3,3,4} then you want to take a sample from {1,2,3,4} with sampling prob. 1/5, 1/5, 2/5, 1/5? – Ege Rubak Sep 03 '19 at 11:01
  • Thank you for your patience. With a population of {1,2,3,3,4}, I want to take a sample with probability of {1/13,2/13,3/13,3/13,4/13}, in other words, the probability of sampling from {1,2,3,4} would be {1/13,2/13,6/13,4/13}. – Gion Mors Sep 03 '19 at 11:07
  • So the numbers have a numeric meaning, and sampling from {11,12,13,13,14} would be fundamentally different? I'm really not sure where to go from here, but it appears that comparing with the ecdf is wrong since it doesn't care about the actual value of the numbers when calculating the jump sizes. What do the numbers specifically refer to? – Ege Rubak Sep 03 '19 at 18:34
  • The values are estimated population counts per 100x100m. I have the statistical distribution of the true population and want to know, given different sample sizes, how well the samples can retrieve the distribution of the true population. I dont want to use mean of variance to compare the two becasue the true distribution is far from being normal. I thougt using the Kormogorov-Smirnov distance could be a good idea but I dont manage to weight my samples to account for non-randomness. This is what I was hoping to achieve with the ewcdf. – Gion Mors Sep 04 '19 at 09:20
1

Your code generates some data d1 and associated numeric weights w1. These data are then treated as a reference population. The code takes a random sample d2 from this population of values d1, with sampling probabilities proportional to the associated weights w1. From the sample, you compute the weighted empirical distribution function f_d_2 of the sampled values d2, with weights inversely proportional to the original sampling probabilities. This function f_d_2 is a correct estimate of the original population distribution function, by the Horvitz-Thompson principle. But it's not exactly equal to the original population distribution, because it's a sample. The Kolmogorov-Smirnov test statistic should not be zero; it should be a small value.

Adrian Baddeley
  • 976
  • 3
  • 6
  • Thank you for the explaination, Adrian. I just did a small change in the code to adjust the weighted empirical distribution function for the relative sample size. A related question, that could be interesting for other people interested in the Kolmogorov-Smirnov — how would you compute the p-value for the test? – Gion Mors Sep 05 '19 at 13:27