18

How can I use AWK to compute the median of a column of numerical data?

I can think of a simple algorithm but I can't seem to program it:

What I have so far is:

sort | awk 'END{print NR}' 

And this gives me the number of elements in the column. I'd like to use this to print a certain row (NR/2). If NR/2 is not an integer, then I round up to the nearest integer and that is the median, otherwise I take the average of (NR/2)+1 and (NR/2)-1.

Johnsyweb
  • 121,480
  • 23
  • 172
  • 229
Nick
  • 615
  • 2
  • 7
  • 9

5 Answers5

22

With awk you have to store the values in an array and compute the median at the end, assuming we look at the first column:

sort -n file | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'

Sure, for real median computation do the rounding as described in the question:

sort -n file | awk ' { a[i++]=$1; }
    END { x=int((i+1)/2); if (x < (i+1)/2) print (a[x-1]+a[x])/2; else print a[x-1]; }'
maxschlepzig
  • 27,589
  • 9
  • 109
  • 146
19

This awk program assumes one column of numerically sorted data:

#/usr/bin/env awk
{
    count[NR] = $1;
}
END {
    if (NR % 2) {
        print count[(NR + 1) / 2];
    } else {
        print (count[(NR / 2)] + count[(NR / 2) + 1]) / 2.0;
    }
}

Sample usage:

sort -n data_file | awk -f median.awk
Johnsyweb
  • 121,480
  • 23
  • 172
  • 229
  • 3
    You can also use `asort` inside awk to sort the array. – Vatine May 29 '11 at 09:15
  • @Vatine: Indeed you can. @Nick said he was using `sort` anyway, so I kept it simple. – Johnsyweb May 29 '11 at 09:27
  • 2
    @Vatine asort() is GNU-awk specific and would make the code a bit more complicated. – Ed Morton Dec 11 '12 at 02:10
  • The `NR % 2` case is wrong, you need to subtract one from the index. For instance, if `NR = 3`, `count[1]` should be printed, but `(3 + 1) / 2 = 2`. – Ruud Jan 23 '16 at 14:53
  • 1
    @RuudvA: That would be true were the the array zero-based but the first time `count[NR] = $1;` is called `NR == 1`. I believe this code to be correct (but, nearly five years later, I don't like `count` as a variable name). – Johnsyweb Jan 24 '16 at 21:39
  • I'm finding that `cat data_file | awk -f median.awk` (i.e., with no call to `sort`) is working, even though my data_file contains unsorted rows of integers. Why do you say the script assumes numerically-sorted data? I get an answer of '4' when my data_file contains . – Kevin May 26 '17 at 18:12
  • Was having trouble with the formatting and now I'm blocked from editing my comment because 5 minutes have passed. Was going to say that my data_file contains (as rows) these numbers: [12 3 4 4 1 4 4 2 4 5 6]. – Kevin May 26 '17 at 18:19
  • @Kevin: It's nice to be reminded of an answer I wrote six years ago! You get the right answer by luck with your input because there are 11 values and 4 is the value at position 6. Were your file to be ordered thus: [4 4 4 4 1 12 2 3 5 6 4], you'd get a very different result. See https://en.wikipedia.org/wiki/Median#Medians_for_samples for more information. – Johnsyweb May 27 '17 at 03:54
  • Thanks for the clarification. That's a subtle error for someone who isn't fully conversant with awk syntax, but I think I see how that works now. Cheers. – Kevin Jun 01 '17 at 22:05
  • 1
    I've compared performance and sort -kn3 is considerably faster than awk '{print $0|"sort -nk3 "}' (14 seconds compared with 66 seconds for a file with 1 million rows and 3 columns). Sorting before computing the median is the faster way to compute the result. As discussed [here](https://stackoverflow.com/questions/17048188/how-to-use-awk-sort-by-column-3). – Tom Kelly Oct 06 '20 at 07:17
4

OK, just saw this topic and thought I could add my two cents, since I looked for something similar in the past. Even though the title says awk, all the answers make use of sort as well. Calculating the median for a column of data can be easily accomplished with datamash:

> seq 10 | datamash median 1
5.5

Note that sort is not needed, even if you have an unsorted column:

> seq 10 | gshuf | datamash median 1
5.5

The documentation gives all the functions it can perform, and good examples as well for files with many columns. Anyway, it has nothing to do with awk, but I think datamash is of great help in cases like this, and could also be used in conjunction with awk. Hope it helps somebody!

Vinicius Placco
  • 1,573
  • 2
  • 9
  • 23
2

This AWK based answer to a similar question on unix.stackexchange.com gives the same results as Excel for calculating the median.

Community
  • 1
  • 1
Brad Parks
  • 54,283
  • 54
  • 221
  • 287
1

If you have an array to compute median from (contains one-liner of Johnsyweb solution):

array=(5 6 4 2 7 9 3 1 8) # numbers 1-9
IFS=$'\n'
median=$(awk '{arr[NR]=$1} END {if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' <<< sort <<< "${array[*]}")
unset IFS
arenaq
  • 2,078
  • 1
  • 19
  • 29