2

I have long list of lines with a lot of situations line this, lines that have identical first word (first string before space), but the rest is different. I need to keep only one line with unique first string.

john jane
john 123
john jim jane
jane john
jane 123
jane 456
jim
jim 1

To have this result:

john jane
jane john
jim

So, if first word in line is the match, delete all but one line.

I can delete all duplicate lines, but leave with lines like in example above,

^(.*)(\r?\n\1)+$

This regex delete identical lines, not like in the example. If there is regex or notepad macro to solve this?

Jim8645
  • 151
  • 1
  • 12

3 Answers3

2

if you have awk

awk '!seen[$1]++' infile.txt

adapted from this thread: Unix: removing duplicate lines without sorting

Community
  • 1
  • 1
Sundeep
  • 19,273
  • 2
  • 19
  • 42
2

With Notepad++ (assuming lines with the same first word are consecutive):

search: ^(\S++).*\K(?:\R\1(?:\h.*|$))+
replacement: nothing

demo

pattern details:

^             # start of the line
(\S++)        # the first "word" (all that isn't a whitespace) captured in group 1
.*            # all characters until the end of the line
\K            # remove characters matched before from the match result
(?:
    \R        # a newline
    \1        # reference to the capture group 1 (same first word)
    (?:
        \h.*  # a horizontal whitespace 
      |       # OR
        $     # the end of the line
    )
)+            # repeat one or more times
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • Confirmed, it is working for my file. Works in ultraedit as well, use it because Notepad++ cant handle very large files. – Jim8645 Jul 15 '16 at 12:26
  • 1
    @Jim8645: note that if you use unix/linux, sp asic awk approach is interesting for large files since it doesn't need to load all the file in memory. – Casimir et Hippolyte Jul 15 '16 at 12:31
0

In Perl:

s/^((\w+).*)\n(?:(?:\2.*\n)*)/$1/gm

You can give it a try with this:

#!/bin/usr/perl

use warnings;
use strict;

my $file = "john jane
john 123
john jim jane
jane john
jane 123
jane 456
jim
jim 1
";

$file =~ s/^((\w+).*)\n(?:(?:\2.*\n)*)/$1\n/gm;

print $file;
José Castro
  • 631
  • 6
  • 14