Awk: Removing duplicate lines without sorting after matching conditions

Question

I've got a list of devices which I need to remove duplicates (keep only the first occurrence) while preserving order and matching a condition. In this case I'm looking for a specific string and then printing the field with the device name. Here is some example raw data from the sar application:

10:02:01 AM       sdc      0.70      0.00      8.13     11.62      0.00      1.29      0.86      0.06
10:02:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:02:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdc      1.31      3.73     99.44     78.46      0.02     17.92      0.92      0.12
Average:          sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:05:01 AM       sdc      2.70      0.00     39.92     14.79      0.02      5.95      0.31      0.08
10:05:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:05:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10:06:01 AM       sdc      0.83      0.00     10.00     12.00      0.00      0.78      0.56      0.05
11:04:01 AM       sda      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:04:01 AM       sdb      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:          sdc      0.70      2.55      8.62     15.91      0.00      1.31      0.78      0.05
Average:          sda      0.12      0.95      0.00      7.99      0.00      0.60      0.60      0.01
Average:          sdb      0.22      1.78      0.00      8.31      0.00      0.54      0.52      0.01

The following will give me the list of devices from lines containing the word "average" but it sorts the output:

sar -dp | awk '/Average/ {devices[$2]} END {for (device in devices) {print device}}'
sda
sdb
sdc

The following gives me exactly what I want (command from here):

sar -dp | awk '/Average/ {print $2}' | awk '!devices[$0]++'
sdc
sda
sdb

Maybe I'm missing something painfully obvious but I can't figure out how to do the same in one awk command, that is without piping the output of the first awk into the second awk.

Jotne · Accepted Answer · 2014-07-17T17:57:00.087

3

You can do:

sar -dp | awk '/Average/ && !devices[$2]++ {print $2}' 
sdc
sda
sdb

The problem is this part for (device in devices). For some reason the for does randomize the output.
I have read a long complicated information on why some where but have not the link.

edited Jul 17 '14 at 17:57

answered Jul 17 '14 at 17:51

Jotne

38,154
10
46
52

awk makes no claims about order of keys retrieved from an array as far as I know. Though in awk 4 you can inform it about the sorting to use when retrieving keys (but I don't know if "input order" is an option). – Etan Reisner Jul 17 '14 at 17:58
2

Awk arrays are stored as hash tables for efficiency. The `in` operator retrieves the elements from the array in the order they are stored in memory, i.e. in whatever order the hashing algorithm arranges them. If you need an array traversed in a specific order you need to decide which order (insertion order? alphabetical? numerical? by element? by index? something else?) and program that order somehow. With GNU awk you can assign an order by populating `PROCINFO["sorted_in"]`, see http://www.gnu.org/software/gawk/manual/gawk.html#Scanning-an-Array. – Ed Morton Jul 17 '14 at 18:15
@EdMorton Thanks for the refreshment. My memory is some limited and for some reason has stared to remove stuff by it self without telling me :) This is the link to the `sorted_in` http://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal – Jotne Jul 17 '14 at 18:23
1

@Jotne tell me about it. I learned French in school and a few years ago started learning Spanish which I eventually realized was just pushing the French out of my brain to make room. The net result is that I now can speak neither of them and am just barely holding onto English.... – Ed Morton Jul 17 '14 at 18:25

score 1 · Answer 2 · answered Jul 17 '14 at 17:52

1

awk '/Average/ && !devices[$2]++ {print $2}' sar.in

You just need to combine the two tests. The only caveat is that in the original the entire line is field two from the original input so you need to replace $0 with $2.

answered Jul 17 '14 at 17:52

Etan Reisner

68,917
7
78
118

2

This looks very like my post :) – Jotne Jul 17 '14 at 17:53

Awk: Removing duplicate lines without sorting after matching conditions

2 Answers2