-3

I have genetic data for SNPs that has been divided into 5 quantiles. I want to find the median of these quantiles for each SNP (i.e. each person).

I used this command to create a column for median values:

data$median<-apply(data[,2:181],1, median, na.rm=TRUE)

Then I wanted to count how many cases and controls I have for each of my phenotypes, but it looks like it's calculating the median incorrectly. My command is as follows:

table(data$anyMI, data$median)

The output is showing:

        1   1.5     2   2.5     3   3.5     4   4.5     5
  0  2044    62  7470   221 11163   248  8389    74  1659
  1   102     3   357    11   557    21   404     2    85

I'm not sure why I'm getting half values, when it should only be 1-5, whole numbers. What is going wrong here and why is it showing half-values?

  • 5
    If you have an even number of observations, you will get a half value. see `median(1:4)` which results in 2.5 as this is where the median is between 2 and 3. – phiver Jun 26 '18 at 13:14
  • 3
    You sued 0 as the 2nd argument in the apply function. This should be 1 to iterate over rows and 2 to iterate over columns. – Lennyy Jun 26 '18 at 13:15
  • sorry I actually had 1, not sure why I typed 0. But the output shown is with 1. – Talia Delamare Jun 26 '18 at 13:22

2 Answers2

1

By defintion a median is a value such as half of your sample is higher, and the other half lower. As phiver said, if you have an even number of values, let's say that the higher boundary of your first half will be x and the lower of the second half will be y, any value between x and y can be the median.

By default, R will state that median = (x+y)/2 in that case.

If you want to have a value from your dataset, you can use an odd number of observation (remove one for instance), or round the result.

Arault
  • 785
  • 4
  • 12
1

According to the standard definition, the median

  1. of an odd number of observations is the middle value

    median(1:5)
    #[1] 3
    
  2. of an even number of observations is the (arithmetic) mean of the two middle two numbers

    median(1:4)
    #[1] 2.5
    

See e.g. the definition of the statistical median on Wolfram MathWorld.


On a more mathematical (put perhaps interesting) side-note:

A different definition of the median of N observations is given through

enter image description here

where the median of x is defined as the y that minimises the sum of L2 distances to all observations.

We can verify that this indeed gives us the same median asmedian:

x <- c(1, 1:4)
x[which.min(sapply(x, function(y) sum(x - y)^2))]
#[1] 2

median(x)
#[1] 2

The interesting thing about the alternative definition is that it allows the extension of the univariate median to the geometric median of a set of points in higher dimensional space. Think: What is the median of three points in 3-d Euclidean space?

Maurits Evers
  • 42,255
  • 4
  • 27
  • 51