9

I just have a file.txt with multiple lines, I would like to remove duplicate lines without sorting the file. what command can i use in unix bash ?

sample of file.txt

orangejuice;orange;juice_apple
pineapplejuice;pineapple;juice_pineapple
orangejuice;orange;juice_apple

sample of output:

orangejuice;orange;juice_apple
pineapplejuice;pineapple;juice_pineapple
choroba
  • 200,498
  • 20
  • 180
  • 248
t28292
  • 573
  • 2
  • 7
  • 12
  • I'd like to see this closed as duplicate, too, but I hope there is a better question to link to. – tripleee Aug 11 '13 at 10:00
  • [Linux Bash commands to remove duplicates from a CSV file](https://stackoverflow.com/q/25393281/608639). Change the delimiter. – jww Jul 13 '18 at 09:39

2 Answers2

37

One way using awk:

awk '!a[$0]++' file.txt
Steve
  • 41,445
  • 12
  • 83
  • 96
  • You can't write this to a file via an alias sourced from bashrc > output.txt it has only one line? – Master James Oct 06 '17 at 10:45
  • root@server:/tmp# alias RDL="awk '!a[\$0]++' cleanList.txt > cleanList2.txt" bash: !a[\$0]++': event not found root@server:/tmp# alias RDL="awk '\!a[$0]++' cleanList.txt > cleanList2.txt" root@mdserver:/tmp# RDL awk: cmd. line:1: \!a[bash]++ awk: cmd. line:1: ^ backslash not last character on line root@server:/tmp# alias RDL="awk '\\!a[$0]++' cleanList.txt > cleanList2.txt" ??? – Master James Oct 06 '17 at 10:53
  • Found this cat -n file_name | sort -uk2 | sort -nk1 | cut -f2- at https://stackoverflow.com/questions/11532157/unix-removing-duplicate-lines-without-sorting – Master James Oct 06 '17 at 10:59
  • better yet the `uniq` command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html – Master James Oct 06 '17 at 11:05
  • @MasterJames: You'll need to single quote that expression, then escape the single quotes like: `alias RDL='awk '\''!a[$0]++'\'' cleanList.txt > cleanList2.txt'`. See: https://stackoverflow.com/a/9899594/751863. Alternatively, just use a function. – Steve Oct 07 '17 at 13:55
  • Thanks for that. Working! [as well... better] – Master James Oct 21 '17 at 10:17
14

You can use Perl for this:

perl -ne 'print unless $seen{$_}++' file.txt

The -n switch makes Perl process the file line by line. Each line ($_) is stored as a key in a hash named "seen", but since ++ happens after returning the value, the line is printed the first time it is met.

choroba
  • 200,498
  • 20
  • 180
  • 248
  • This in an alias when output to a file > output.txt creates an empty file? alias RDL="perl -ne 'print unless $seen{$_}++' cleanList.txt > cleanList2.txt" root@server:/tmp# RDL Can't modify anonymous hash ({}) in postincrement (++) at -e line 1, near "}++" Execution of -e aborted due to compilation errors. root@server:/tmp# – Master James Oct 06 '17 at 10:46
  • Found this cat -n file_name | sort -uk2 | sort -nk1 | cut -f2- at https://stackoverflow.com/questions/11532157/unix-removing-duplicate-lines-without-sorting – Master James Oct 06 '17 at 10:59
  • the `uniq` command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html – Master James Oct 06 '17 at 11:05
  • @MasterJames: The OP wanted to process the file "without sorting", which `uniq` can't do. – choroba Oct 06 '17 at 12:19
  • I see now uniq only removes repeat lines not dups from input. It only skips whwn they'really the same on the next line (aka repeats not dups). This is fine for my situation, so i didn't notice that sort makes dups=repeats which uniq skips. Without sort dups that are non repeating are not removed. Thanks for clarity. – Master James Oct 10 '17 at 04:43