Remove duplicate lines without sorting

Question

I have a utility script in Python:

#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
  if line in unique_lines:
    duplicate_lines.append(line)
  else:
    unique_lines.append(line)
    sys.stdout.write(line)
# optionally do something with duplicate_lines

This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?

Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.

Unrelated: you should really use a set rather than a list in that Python script; checking for membership in a list is a linear-time operation. — Nicholas Riley, Jul 17 '12 at 23:18
I removed "Python" from your tags and title since this really has nothing to do with Python. — Michael Hoffman, Jul 17 '12 at 23:20
if this had to be done in Python a better approach would involve using the uniq_everseen itertools recipe: http://docs.python.org/library/itertools.html#recipes — iruvar, Jul 23 '12 at 17:02

score 303 · Accepted Answer · edited Feb 19 '18 at 16:06

303

The UNIX Bash Scripting blog suggests:

awk '!x[$0]++'

This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.

edited Feb 19 '18 at 16:06

jameshfisher

26,641
22
94
145

answered Jul 17 '12 at 23:17

Michael Hoffman

27,420
6
55
80

10

For a short `awk` statement like this (no curly brackets involved), the command is simply telling awk which lines to print. The variable `$0` holds the entire contents of a line and square brackets are array access. So, for each line of the file, we are incrementing a node of the array named `x` and printing the line if the content of that node was not (`!`) previously set. – Jeff Klukas Dec 17 '12 at 14:59
I did a loop of 1000 runs with `sort -u` and that `awk` one, and both run in about 3s (awk took 0.15s more in avg). So I think it will work perfectly, thx! – Aquarius Power Jun 04 '14 at 11:24
@AquariusPower But doesn't awk become faster than sort, if you increase the the size of the unordered input file? – jarno May 12 '15 at 18:01
2

Perhaps this command would be easier to understand `awk '!($0 in x){x[$0]++; print $0}` – jarno May 12 '15 at 20:02
!x[$0] does not test, if x[$0] is not set, but if x[$0] is zero or empty string. ($0 in x) tests if x[$0] is set. However unset variables have zero (or empty string) value in awk, when asked, so the test works. Besides, the post-fix operator `++` is performed after the logical not operator (`!`), which is crucial in the script. – jarno May 12 '15 at 21:22
9

Most compact and finest scripts I tumbled across. Kudos! – Dhaval Patel Dec 21 '15 at 05:46
30

Surely it would be less obfuscated to name that array e.g. `seen` instead of `x`, to avoid giving newbies the impression that awk syntax is line noise – Josip Rodin Dec 21 '15 at 10:43
5

Keep in mind that this will load the entire file into memory, so don't try this on a 3GB text file without lots of RAM to spare. – Hitechcomputergeek Jun 02 '17 at 15:39
How to keep all the empty lines? – ElpieKay May 21 '18 at 12:18
6

@Hitechcomputergeek This won't necessarily load the whole file into memory, only the unique lines. This of course could end up being the whole file though if all the lines are unique. – deltaray Jul 11 '18 at 17:33
Thank You! This is the finest and smartest solution to find unique elements within an array when I'm parsing tags in a delimited file. – Lanti Aug 21 '18 at 13:00
getting error as x[: event not found – Chandan Choudhury Sep 01 '18 at 16:40
@ChandanChoudhury The quotation marks are not optional. – Michael Hoffman Sep 01 '18 at 18:26
3

I had to use `awk '!mem[$0]++ { print $0; fflush() }'`, because the buffering otherwise broke the point of the script I was developing. – Deiwin Sep 08 '18 at 16:53
2

https://stackoverflow.com/a/1444448/44620 with a detailed description of how it works. – Jonas Elfström Oct 26 '18 at 14:58
Maybe a quick way to do this inline? – MappaM Mar 04 '20 at 13:29
1

This will work if you also want to retain empty lines: `awk 'length==0 || !x[$0]++'`. – PhilippVerpoort Apr 02 '20 at 20:06
The Stackoverflow school of Bashcraft and AWKary! The trio would be so proud!!! – sErVerdevIL Aug 07 '20 at 05:33
Worth putting in your .bash_aliases if you find yourself using it often. `alias unique='awk "!seen[\$0]++"'` Then you can just `echo "$values" | unique` – Cameron Basham Jan 15 '21 at 20:02

score 72 · Answer 2 · edited Jun 29 '20 at 17:12

72

A late answer - I just ran into a duplicate of this - but perhaps worth adding...

The principle behind @1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:

cat -n file_name | sort -uk2 | sort -n | cut -f2-

Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')

edited Jun 29 '20 at 17:12

aksh1618

1,422
11
30

answered Dec 17 '13 at 16:39

Digital Trauma

13,834
2
40
73

3

Easy to understand, and this is often valuable. Any ideas of performance with big files against shortest Michael Hoffman's solution above? – Sopalajo de Arrierez Jan 01 '15 at 02:50
2

More readable/maintainable. Needed the same but with a reverse sort to keep only the last occurrence of each unique value. Using both `--reverse` and `--unique` in the same sort command doesn't return the results one might expect. Apparently, sort does a premature optimization by 1st applying `--unique` on the input (in order to reduce processing in subsequent steps). This removes data needed for the `--reverse` step too early. To fix this, insert a `sort --reverse -k2` as the 1st sort in the pipeline: `cat -n file_name | sort -rk2 | sort -uk2 | sort -nk1 | cut -f2-` – Petru Zaharia Apr 24 '17 at 09:36
1

Took just 60 seconds for a 900MB+ text file with so many (randomly placed) duplicate lines that the result is only 39KB. Sufficiently fast. – ynn Jul 24 '19 at 14:09
"Pipe" version: `cat file_name | cat -n | sort -uk2 | sort -nk1 | cut -f2-`. – Victor Yarema Jan 15 '20 at 18:01
2

"Pipe" version for keeping last occurrence instead of first one: `cat file_name | cat -n | sort -rk2 | sort -uk2 | sort -nk1 | cut -f2-`. – Victor Yarema Jan 15 '20 at 18:02

score 8 · Answer 3 · answered Aug 22 '17 at 03:32

8

To remove duplicate from 2 files :

awk '!a[$0]++' file1.csv file2.csv

answered Aug 22 '17 at 03:32

AzizSM

5,763
4
36
51

score 5 · Answer 4 · answered Jul 23 '12 at 16:43

5

Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

answered Jul 23 '12 at 16:43

iruvar

20,820
6
47
78

this seems to be rather slow, though – galois Aug 24 '15 at 06:43

score 4 · Answer 5 · answered Apr 30 '18 at 08:45

4

Now you can check out this small tool written in Rust: uq.

It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.

answered Apr 30 '18 at 08:45

Shou Ya

2,438
1
19
42

Quite inconvenient and less portable, given awk already does this. – AhmetB - Google Feb 09 '20 at 22:27

score 2 · Answer 6 · answered Oct 23 '13 at 18:26

Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:

awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'

score -1 · Answer 7 · edited Feb 05 '16 at 10:50

-1

I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:

awk '{
  if ($0 != PREVLINE) print $0;
  PREVLINE=$0;
}'

edited Feb 05 '16 at 10:50

Bence Kaulics

6,335
7
27
54

answered Feb 05 '16 at 10:08

speedolli

1
1

10

doesn't uniq do just that... – Mischa Molhoek Nov 09 '16 at 11:22

score -1 · Answer 8 · answered Oct 06 '17 at 11:03

-1

the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

answered Oct 06 '17 at 11:03

Master James

1,279
13
16

Remove duplicate lines without sorting

8 Answers8

Linked

Related