2

I have files containing lots of lines of data, some of which are duplicated. I want to delete duplicate lines if they follow each other.

For example if the input file contained this:

string1
string2
string2
string3
string1
string4
string4
string4

I would want the output file to read:

string1
string2
string3
string1
string4

I am fairly new to bash scripts. I am presuming awk is the way to go but I am a bit stumped. Any help appreciated.

ghfunk
  • 31
  • 2

3 Answers3

3

The command uniq does precisely this.

It is very often used in combination with sort so that duplicates will be adjacent.

tripleee
  • 139,311
  • 24
  • 207
  • 268
2

You can use awk:

awk '$0==b{next}{b=$0;print}' a.txt
string1
string2
string3
string1
string4

I'm using the variable b which stands for buffer. If the current line is already in the buffer it does not print the line. Otherwise it puts the line to the buffer an prints it.

hek2mgl
  • 133,888
  • 21
  • 210
  • 235
  • You made it, I prefer this to my answer. Good one. – fedorqui 'SO stop harming' Jun 11 '14 at 15:24
  • @fedorqui Thx! The one from anubhava is even shorter as he reverted the logic.. – hek2mgl Jun 11 '14 at 15:25
  • 1
    @fedorqui anubhava's solution is in fact same as yours. in your deleted solution, the `!a[$0]++` doesn't make any sense, you can just remove it, then it would be same as anub..'s. – Kent Jun 11 '14 at 15:28
  • @Kent I see... I was thinking in [How can I delete duplicate lines in a file in Unix?](http://stackoverflow.com/questions/1444406/how-can-i-delete-duplicate-lines-in-a-file-in-unix) and then noticed it wasn't the way. But people is so fast here and other better answer were already posted, so I just deleted instead of trying to reformulate it :) – fedorqui 'SO stop harming' Jun 11 '14 at 15:30
  • 1
    I also tried posting an answer, I came with same as yours @fedorqui, (without the `!a[$0]++`) then you posted first. I think I should comment your answer, after wrote the comment, you deleted!! and anubhava's answer came, I even thought I can write if check in a positive check, which collided hek2mgls again!.... you guys are too damn fast!!! – Kent Jun 11 '14 at 15:34
  • 1
    @Kent I hope your brain didn't explode after this situation :D In general, it is always complicated to know what to do. Sometimes we write some answers within a couple of minutes and any small change can make the answer collide with another one... I am learning to delete on the spot, to avoid strange situations. Also, you are so polite commenting first and giving good ideas! :) – fedorqui 'SO stop harming' Jun 11 '14 at 15:49
2

This awk should also work:

awk '$1!=p{print} {p=$1}' file
string1
string2
string3
string1
string4

Or you can shorten this even further:

awk '$1!=p; {p=$1}' file
anubhava
  • 664,788
  • 59
  • 469
  • 547
  • Brilliant, just what I needed. Thanks for the fast response everyone. – ghfunk Jun 11 '14 at 15:36
  • 1
    Nice awk solution, although I didn't know about 'uniq' which does just what I want. – ghfunk Jun 11 '14 at 15:46
  • @ghfunk Note that this would only work as long as the _strings_ do not contain any spaces. – devnull Jun 11 '14 at 15:47
  • It can be easily changed to use `$0` instead of `$1` for including spaces. – anubhava Jun 11 '14 at 15:52
  • That might be obvious to those who already know the answer, not to somebody asking this. – devnull Jun 11 '14 at 15:53
  • Besides, `awk` is clearly the wrong tool for this specific case and is several times slower than `uniq`. – devnull Jun 11 '14 at 15:55
  • @devnull: Just curious if you have run any benchmarks on `sort+uniq` vs `awk` – anubhava Jun 11 '14 at 15:58
  • `sort` is not even required. It seems that you need to read the `uniq` manual. Regarding the benchmarks, there are some statistics available on the linked duplicate. – devnull Jun 11 '14 at 15:59
  • So it makes it evident that you need to RTFM. See `man uniq`. Just because a flawed answer gathers upvotes doesn't necessarily make it a good or great answer. Good luck! – devnull Jun 11 '14 at 16:01
  • 1
    I very well know what `uniq` does and how to use it. I mentioned `sort` since triplee suggested use of `sort` with `uniq`. – anubhava Jun 11 '14 at 16:03
  • Thanks for all your responses I am learning a lot. uniq does just what I want. – ghfunk Jun 11 '14 at 16:07
  • @ghfunk: Yes `uniq` (without sort) is indeed the simplest solution for your problem. – anubhava Jun 11 '14 at 16:08