0

I have a long vocabulary list, one word per line. Sometimes, there is a duplicate word, appearing more than once OR TWICE. I need a simple code that will leave the first occurrence of a word, but remove all duplicates (with its line) after it.

  1. I don't want to remove any special characters or rearrange anything, only remove the words (one per line). Keeping the same word order is important.

  2. It doesn't matter if it overwrites the original file or saves to a new one, whichever is "more efficient".

  3. This is a list separated by line, not an array, not separated by space or comma.

  4. I have not code to start with, hoping to solve with BASH...

    • sed would be first choice

    • grep would be second choice

    • Third choice would be something like a for loop

Eg: file.txt

apple
banana
car
bicycle
apple
tree
banana
apple
motorcycle

...should become:

apple
banana
car
bicycle
tree
motorcycle

I see some solutions for ARRAYS, but not simple lists, and answers via python, js, and C languages, but not BASH. If this has already been answered, show me where and I will gladly delete this question. The suggested dupl. article uses awk, which is outside of the scope of this question, though it is related and useful.

Jesse Steele
  • 267
  • 1
  • 3
  • 17
  • 1
    Not a duplicate, that answer is "by comma", this is by line (in my title). – Jesse Steele Oct 24 '18 at 04:27
  • The *question* mentions commas but many of the answers solve exactly your problem. I can dig for a better duplicate if you like; I'm pretty sure there are multiple questions exactly like this. – tripleee Oct 24 '18 at 04:28
  • @tripleee: "Keeping the same word order is important.". Most of the solutions there use `uniq`, which requires the file to be sorted. The only solution that addresses this scenario that I noticed is the 2-vote AWK one (not `bash` as the OP suggested would be the preference). – Amadan Oct 24 '18 at 04:32
  • Okay, using uniq? That didn't answer that specific question about commas, tho. Should I delete my question? Should I make a new question and answer it? I will leave this question a short while, then do what you suggest. Thanks so much!! – Jesse Steele Oct 24 '18 at 04:32
  • I added another duplicate with more solutions. It's fine to have duplicates, the site will eventually remove the question if it doesn't receive traffic. – tripleee Oct 24 '18 at 04:35
  • @tripleee That one is a much better dupe target... Just there's no bash solution :) as that question prefers sed/awk. – Amadan Oct 24 '18 at 04:41
  • @Amadan Then post yours as an answer to that one instead. Having multiple questions with the same topic with different answers is just inefficient and confusing. It's the [DRY principle.](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) – tripleee Oct 24 '18 at 04:43
  • 1
    [bash remove duplicate lines in a text file site:stackoverflow.com](https://www.google.com/search?q=bash+remove+duplicate+lines+in+a+text+file+site%3Astackoverflow.com) – jww Oct 24 '18 at 04:44
  • @tripleee: I guess I'm a bit more literal-minded - that question says "I want to use `sed` or `awk`", this one says "I see some solutions for [various], but not BASH". – Amadan Oct 24 '18 at 04:45
  • @Amadan That's a red herring anyway; the OP mentions `sed` and `grep` as possibilities. Questions are not supposed to prescribe an answer in any event; often, the best answer is to change the OP's approach slightly. – tripleee Oct 24 '18 at 04:49
  • https://stackoverflow.com/questions/11532157/remove-duplicate-lines-without-sorting?noredirect=1 has many peculiar answers too. – tripleee Oct 24 '18 at 04:51
  • Also https://stackoverflow.com/questions/24324350/how-to-remove-common-lines-between-two-files-without-sorting – tripleee Oct 24 '18 at 04:52
  • @Amadan actually read carefully, ty. – Jesse Steele Oct 24 '18 at 06:00
  • @tripleee you say questions shouldn't suggest what the answer could include, even for clarity? This wasn't exactly a complex sed/grep lesson. But okay, I'll take that part out just for you. – Jesse Steele Oct 24 '18 at 06:01
  • That's not a criticism of your question; it's just that when we answer, we often cannot assume that the asker has a good grasp of what a sensible solution should look like. – tripleee Oct 24 '18 at 06:16
  • @tripleee really, thank you. I really understand what you mean. I feel the love today, with this question I asked, and learned a TON about coding in Linux. It's interesting, I just mailed my US election absentee ballot AND earned my ability to vote on SO, all today. You were part of that and I appreciate what you taught me. – Jesse Steele Oct 24 '18 at 06:20

4 Answers4

3

This might work for you (GNU sed):

sed -nr 'G;/^([^\n]+\n)([^\n]+\n)*\1/!{P;h}' file

Keep a list of unique keys in the hold space and if the current key is not in the list print it and add it to the list.

potong
  • 47,186
  • 6
  • 43
  • 72
1

If you weren't overly concerned about maintaining the order, you could just use the very simple:

sort -u inputFileName >outputFileName

This would get rid of all duplicates, sorting in the process.

For maintaining the order based on first occurrence, it becomes more complex (and memory hungry). Using associative arrays in awk is one way, as per the following example:

pax> cat infile
zanzibar
apple
banana
apple
carrot
banana
sausage
apple

awk '{if(x[$1]==0){x[$1]=1;print}}' infile
zanzibar
apple
banana
carrot
sausage

The way this works is that awk will, the first time it sees a word, store the fact that it's seen it and output the word. Later instances of that word will do nothing because the word has already been marked as seen.

paxdiablo
  • 772,407
  • 210
  • 1,477
  • 1,841
  • maintaining order is important, I will edit the question to say this. Thanks for this. – Jesse Steele Oct 24 '18 at 04:27
  • 1
    @JesseSteele -- the `awk` solution does that... – David C. Rankin Oct 24 '18 at 04:39
  • @DavidC.Rankin the problem with awk is that I didn't mention it in the original question. BUT, it really did solve the problem simply and has opened my mind to using awk in the future. Previously, awk was not always installed with my Linux distro, but in U 18.04 it seems to be. (FYI, this article was what David referred to, but has been taken down from this question: https://stackoverflow.com/questions/9377040/remove-duplicate-entries-using-a-bash-script) – Jesse Steele Oct 24 '18 at 06:08
1

Pure bash:

#!/bin/bash
declare -g -A lines
while IFS='' read -r line
do
  if [[ "${lines["$line"]}" -ne 1 ]]
  then
    echo "$line"
    lines["$line"]=1
  fi
done

EDIT: If you make it into a standalone executable script, you could do it with dedupe.sh < file.txt. If you want to hard-code the file name in there, you can do so like this:

while ....
  ...
done < file.txt
Amadan
  • 169,219
  • 18
  • 195
  • 256
  • 1
    Now that is a good use of an associative array as a *frequency array*! Strange... I voted for this question and then I guess somebody downvoted, perhaps because the concept of a frequency array is bewildering to them.. – David C. Rankin Oct 24 '18 at 04:42
  • @DavidC.Rankin Well, it's just a seen/not-seen array now, I'm not actually counting anything, though it could be adapted for frequency array very easily. – Amadan Oct 24 '18 at 04:49
  • Well, you are actually storing `lines["$line"]=1` after each `line` is seen. Which provides the seen/not see check as a 0/1. I guess for a pure frequency array you would need `((lines["$line"]++))` instead of `=1`. – David C. Rankin Oct 24 '18 at 04:52
  • If you put the `lines["$line"]=1` inside the `if`, it *may* perform better. I'm also not sure how well the performance would be for large files since `bash` read loops are sometimes less than speedy :-) Not worth a downvote of course, since OP didn't specify file size. – paxdiablo Oct 24 '18 at 05:00
  • @paxdiablo: Thanks, yeah, that makes sense. (Getting further away from frequency array, sorry David :D ) – Amadan Oct 24 '18 at 05:01
  • @Amadan this is AWESOME, BUT I don't know where file.txt fits into it. Can you edit it or tell me how to make it work that way. This was third choice and I appreciate this knowledge of BASH you shared here! wow. – Jesse Steele Oct 24 '18 at 06:14
  • @DavidC.Rankin I also voted for it, now that I can vote!!! (15 today). This answer really did cut to the heart of my OP. Had I not put `sed` as my "first choice" already, I would have made this the "correct" answer. This answer made my question here NOT a duplicate for sure! – Jesse Steele Oct 24 '18 at 06:28
-1

Once you sort the file using sort you can then remove adjacent duplicate lines using uniq.

Man pages uniq

sort unsorted.txt | uniq >> sorted_deduped.txt

Peter Halligan
  • 565
  • 4
  • 14
  • 2
    The OP very specifically and repeatedly ruled out sorting the file. (And if you do, `sort -u` will beat `sort | uniq`) – tripleee Oct 24 '18 at 04:41
  • Did mention prefer when I read the question it has been edited since – Peter Halligan Oct 24 '18 at 04:44
  • 1
    The original also had it, though less directly: "I don't want to remove any special characters _or rearrange anything_" – Amadan Oct 24 '18 at 04:46
  • Obviously not clearly enough as I was not the only person to answer using a sort. – Peter Halligan Oct 24 '18 at 04:49
  • @Amadan thank you for reading and understanding. Peter has a point that it could be clarified better. But, IMHO, various responses to this question prove that many people on SO skim in their reading ;-) ty – Jesse Steele Oct 24 '18 at 06:26