5

I have a stupidly large text file (i.e. 40 gigabytes as of today) that I would like to filter for unique lines without sorting the file.

The file has unix line endings, and all content matches [[:print:]]. I tried the following awk script to display only unique lines:

awk 'a[$0] {next} 1' stupid.txt > less_stupid.txt

The thought was that I'd populate an array by referencing its elements, using the contents of the file as keys, then skip lines that were already in the array. But this fails for two reasons -- firstly, because it inexplicably just doesn't work (even on small test files), and secondly because I know that my system will run out of memory before the entire set of unique lines is loaded into memory by awk.

After searching, I found this answer which recommended:

awk '!x[$0]++'

And while this works on small files, it also will run out of memory before reading my entire file.

What's a better (i.e. working) solution? I'm open to just about anything, though I'm more partial to solutions in languages I know (bash & awk, hence the tags). In trying to visualize the problem, the best I've come up with would be to store an array of line checksums or MD5s rather than the lines themselves, but that only saves a little space and runs the risk of checksum collisions.

Any tips would be very welcome. Telling me this is impossible would also be welcome, so that I stop trying to figure it out. :-P

Community
  • 1
  • 1
Graham
  • 1,561
  • 13
  • 21
  • 3
    Why do you want to avoid a sort? Do you need something that works faster than sorting? Would the sorted file take too much space? Or is there some other consideration that makes a sort infeasible? – user2357112 supports Monica Jun 18 '15 at 04:00
  • 3
    If you can't store the full set of unique lines in memory (and the lines are short enough that even a reasonable sum of each of them (sha256 or whatever) doesn't fit into memory) then I'm not sure this is possible. – Etan Reisner Jun 18 '15 at 04:05
  • @user2357112, I don't want to sort because the order of the lines in the original file is important. I could potentially timestamp the lines going in to the file, but then I wouldn't be able to determine unique lines because times would be different. Repeats usually happen within a few thousand lines of each other, so that might help. – Graham Jun 18 '15 at 04:08
  • Is it crucial that the output come out in some particular order, then? For example, always outputting lines in order of first occurrence? – user2357112 supports Monica Jun 18 '15 at 04:10
  • 2
    Your original script fails because you never alter `a[$0]`, unlike the answer you found. – Jonathan Leffler Jun 18 '15 at 04:11
  • @user2357112, the output should come out in the same order that it went in. The second/third/etc instance of any line that has been seen, should be ignored. – Graham Jun 18 '15 at 04:12
  • @JonathanLeffler, AH, okay, I misunderstood how that worked; I thought that simply referring to an array subscript would cause it to exist. Thanks for clarifying. – Graham Jun 18 '15 at 04:12
  • 1
    You said the set of unique lines is too large for memory. Is the set of duplicate lines also too large? That is if you sort and find the lines with duplicates will that list fit into memory? – Etan Reisner Jun 18 '15 at 04:12
  • When you have two lines, say at lines 1000 and 2000, which are 'the same', which line do you delete? The first or the second? I worry about your file design if position is all that identifies different lines and order matters. – Jonathan Leffler Jun 18 '15 at 04:13
  • 1
    Referring to it probably does cause it to exist — as an empty/zero value which compares false and therefore doesn't trigger the next clause, etc. The alternative increments the value from 0 to 1 on the first time, and prints if the value was 0 (uninitialized). – Jonathan Leffler Jun 18 '15 at 04:17
  • @JonathanLeffler, so something like `$0 in a{next} a[$0]; 1` would work. Though the other one is much shorter. :) Oh, and regarding which one to drop, I always only want to keep the first line of a set of duplicates. – Graham Jun 18 '15 at 04:30
  • 1
    @EtanReisner, that's a great question - I might be able to do multiple passes, one to identify duplicated data, the other to prune it. Thanks for that, I'll investigate! – Graham Jun 18 '15 at 04:31
  • 1
    The question seems to be more about how to deal with **Huge files** because Memory is bottleneck when you work with such huge files. I feel "how to sort/uniq" is not the real concern here. May be you should look at http://www.slideshare.net/directi/mapreducedirecti?qid=9d03d57b-beda-4248-b914-745a723334be&v=qf1&b=&from_search=1. Dealing with gingatic data is why technology like MapReduce exist. At least, this would give you some more option to explore outside the realms of usual `sort/uniq` memory pitfalls. – slayedbylucifer Jun 18 '15 at 07:06

6 Answers6

7

The awk '!x[$0]++' trick is one of the most elegant solutions to de-duplicate a file or stream without sorting. However, it is inefficient in terms of memory and unsuitable for large files, since it saves all unique lines into memory.

However, a much more efficient implementation would be to save a constant-length hash representation of the lines in the array rather than the whole line. You can achieve this with Perl in one line and it is quite similar to the awk script.

perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' huge.txt

Here I used md5_base64 instead of md5_hex because the base64 encoding takes 22 bytes, while the hex representation 32.

However, since the Perl implementation of hashes still requires around 120bytes for each key, you may quickly run out of memory for your huge file.

The solution in this case is to process the file in chunks, splitting manually or using GNU Parallel with the --pipe, --keep-order and --block options (taking advantage of the fact that duplicate lines are not far apart, as you mentioned). Here is how you could do it with parallel:

cat huge.txt | pv | 
parallel --pipe --keep-order --block 100M -j4 -q \
perl -ne 'use Digest::MD5 qw(md5_base64); print unless $seen{md5_base64($_)}++' > uniq.txt

The --block 100M option tells parallel to process the input in chunks of 100MB. -j4 means start 4 processes in parallel. An important argument here is --keep-order since you want the unique lines output to remain in the same order. I have included pv in the pipeline to get some nice statistics while the long running process is executing.

In a benchmark I performed with a random-data 1GB file, I reached a 130MB/sec throughput with the above settings, meaning you may de-duplicate your 40GB file in 4 minutes (if you have a sufficiently fast hard disk able to write at this rate).

Other options include:

  • Use an efficient trie structure to store keys and check for duplicates. For example a very efficient implementation is marisa-trie coded in C++ with wrappers in Python.
  • Sort your huge file with an external merge sort or distribution/bucket sort
  • Store your file in a database and use SELECT DISTINCT on an indexed column containing your lines or most efficiently md5_sums of your lines.
  • Or use bloom filters

Here is an example of using the Bloom::Faster module of Perl:

perl -e 'use Bloom::Faster; my $f = new Bloom::Faster({n => 100000000, e => 0.00001}); while(<>) { print unless $f->add($_); }' huge.txt > uniq.txt

You may install Bloom::Faster from cran (sudo cran and then install "Bloom::Faster")

Explanation:

  • You have to specify the probabilistic error rate e and the number of available buckets n. The memory required for each bucket is about 2.5 bytes. If your file has 100 million unique lines then you will need 100 million buckets and around 260MB of memory.
  • The $f->add($_) function adds the hash of a line to the filter and returns true if the key (i.e. the line here) is a duplicate.
  • You can get an estimation of the number of unique lines in your file, parsing a small section of your file with dd if=huge.txt bs=400M count=1 | awk '!a[$0]++' | wc -l (400MB) and multiplying that number by 100 (40GB). Then set the n option a little higher to be on the safe side.

In my benchmarks, this method achieved a 6MB/s processing rate. You may combine this approach with the GNU parallel suggestion above to utilize multiple cores and achieve a higher throughput.

henfiber
  • 1,029
  • 7
  • 11
  • This is great. I've just sorted though 50+ files, all over 25M lines in less than 10 min! – user2117258 Nov 08 '16 at 02:20
  • @user2117258. Glad it helped. Which method from the suggested ones did you use? Just take notice, that the "bloom filter" approach may wrongly remove a small fraction (equal to the error rate `e`) of non-duplicate lines. The `parallel + perl` approach will remove only duplicate lines, but only those in the same "chunk" (which is defined by the `--block 100M` option in `parallel`). If the duplicate lines are close enough, this is going to work. You may re-run the process multiple times, ideally with different `--block` sizes, to detect previously-missed duplicate lines in the new chunks. – henfiber Nov 08 '16 at 12:43
3

I don't have your data (or anything like it) handy, so I can't test this, but here's a proof of concept for you:

$ t='one\ntwo\nthree\none\nfour\nfive\n'
$ printf "$t" | nl -w14 -nrz -s, | sort -t, -k2 -u | sort -n | cut -d, -f2-
one
two
three
four
five

Our raw data includes one duplicated line. The pipes function as follows:

  • nl adds line numbers. It's a standard, low-impact unix tool.
  • sort the first time 'round sorts on the SECOND field -- what would have been the beginning of the line before nl. Adjust this as required for you data.
  • sort the second time puts things back in the order defined by the nl command.
  • cut merely strips off the line numbers. There are multiple ways to do this, but some of them depend on your OS. This one's portable, and works for my example.

Now... For obscenely large files, the sort command will need some additional options. In particular, --buffer-size and --temporary-directory. Read man sort for details about this.

I can't say I expect this to be fast, and I suspect you'll be using a ginormous amount of disk IO, but I don't see why it wouldn't at least work.

ghoti
  • 41,419
  • 7
  • 55
  • 93
  • 1
    I don't know how portable `nl` is; you could easily number lines with `awk`. Personally, I'd use a fixed-length zero-filled number format (`nl -w12 -nrz`); that will allow you to be more precise about where the data starts and lets you use `cut` to remove the numbers. But a definite +1 – rici Jun 18 '15 at 04:35
  • @rici, all great points, thanks. In my experience, `nl` has been on everything from SunOS and HP/UX to Linuces and *BSD, though it was missing from MINIX 2.0. Good idea regarding the number format - I expect that would also speed up `sort`. – ghoti Jun 18 '15 at 04:44
  • Looks good as long as `sort` is guaranteed to keep the lowest line number of duplicate lines. I'm not sure it is guaranteed to do so. You'll definitely only have one line from each set of duplicates, but I'm not sure that it is guaranteed to be the lowest number from each set of duplicates. – Jonathan Leffler Jun 18 '15 at 04:59
  • An additional sub-sort on the first field might be necessary to deal with the issue @JonathanLeffler pointed out. `sort -t, -k2 -k1,1 -u` or something like that. (Though that might be default operation anyway.) – Etan Reisner Jun 18 '15 at 09:55
3

Assuming you can sort the file in the first place (i.e. that you can get sort file to work) then I think something like this might work (depends on whether a large awk script file is better then a large awk array in terms of memory usage/etc.).

sort file | uniq -dc | awk '{gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++; if (x["NR"]>1){next}}"} END{print 7}' > dedupe.awk
awk -f dedupe.awk file

Which on a test input file like:

line 1
line 2
line 3
line 2
line 2
line 3
line 4
line 5
line 6

creates an awk script of:

$0=="line 2"{x[1]++; if (x[1]>1){next}}
$0=="line 3"{x[2]++; if (x[2]>1){next}}
7

and run as awk -f dedupe.awk file outputs:

line 1
line 2
line 3
line 4
line 5
line 6

If the size of the awk script itself is a problem (probably unlikely) you could cut that down by using another sentinel value something like:

sort file | uniq -dc | awk 'BEGIN{print "{f=1}"} {gsub("\"", "\\\"", $0); print "$0==\""substr($0, index($0, $1) + 2)"\"{x["NR"]++;f=(x["NR"]<=1)}"} END{print "f"}'

which cuts seven characters off each line (six if you remove the space from the original too) and generates:

{f=1}
$0=="line 2"{x[1]++;f=(x[1]<=1)}
$0=="line 3"{x[2]++;f=(x[2]<=1)}
f

This solution will probably run slower though because it doesn't short-circuit the script as matches are found.

If runtime of the awk script is too great it might even be possible to improve the time by sorting the duplicate lines based on match count (but whether that matters is going to be fairly data dependent).

Etan Reisner
  • 68,917
  • 7
  • 78
  • 118
  • Wow. You and ghoti have both come up with reasonable-sounding solutions to something which others have told me is impossible. Thanks, I well test and see how it goes. – Graham Jun 18 '15 at 04:35
  • I'm always nervous about code that writes code, but .. this is elegant. Nicely done. +1. – ghoti Jun 18 '15 at 04:50
  • @ghoti Yeah, not usually my go-to solution either (though it definitely has its place) but this case just lent itself to it really nicely. – Etan Reisner Jun 18 '15 at 07:20
  • @EtanReisner: it's a very pretty solution but the resulting script is likely to be close to the size of the file, no? You'd need a lot of memory to handle an awk script of 40GB. Also, the linear search through all the lines makes this algorithm O(n^2), right? and although we don't know how long the lines are, I suspect that `n` is quite large. But I like metaprogramming, so have an upvote. – rici Jun 18 '15 at 15:26
  • @rici That's part of why I asked if the known duplicate list is smaller than the known unique list (I assume it is by a fair margin so this should be a good bit smaller than the original input I would hope). Yes, this is going to perform fairly badly (hence my comment about sorting for frequency to try to help with that a bit). A better solution to this would be to take the list and sort by prefix and do fancier tree-based matching (what the trie and bloom filter ideas are essentially). Also I was operating under the theory that awk *might* "optimize" the script more than the array internally. – Etan Reisner Jun 18 '15 at 16:07
  • @EtanReisner: The array is a hashtable. It's possible that some `awk` might optimize a program all of whose conditions are simple string matches into an internal hash table for the initial action, but that optimization seems like a lot of work for generally little benefit. So my guess is that the array would be more efficient in both time and space. But it would certainly be worthwhile benchmarking. – rici Jun 18 '15 at 16:52
  • @rici I agree on basically all counts (and that optimization wasn't exactly what I was suggesting). However, an interpreting awk wouldn't need to store the script in memory itself at all (at the cost of repeated reading of the script from fd/cache) and that might be beneficial here (I have no idea how awk works internally). But without many more numbers speculation is hard-to-impossible. Interesting to think about nonetheless though. – Etan Reisner Jun 18 '15 at 17:02
  • @EtanReisner: Most awks precompile the program into something easier to execute; in some cases, even into a virtual machine. They don't rely on the file not changing during script execution, or (like bash) attempt to change behaviour as the file rewrites itself. – rici Jun 18 '15 at 17:32
3

I'd do it like this:

#! /bin/sh
usage ()
{
    echo "Usage:  ${0##*/} <file> [<lines>]" >&2
    exit 1
}


if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
    expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi

LC_ALL=C
export LC_ALL

split -l ${2:-10000} -d -a 6 "$1"

for x in x*; do
    awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"
done

cat $(sort -n yx*) | sort | uniq -d | \
    while IFS= read -r line; do
        fgrep -x -n "$line" /dev/null yx* | sort -n | sed 1d | \
            while IFS=: read -r file nr rest; do
                sed -i -d ${nr}d "$file"
            done
    done

cat $(sort -n yx*) >uniq_"$1" && rm -f yx*

(proof of concept; needs more polishing before being used in production).

What's going on here:

  • split splits the file in chunks of 10000 lines (configurable); the chunks are named x000000, x000001, ...
  • awk removes duplicates from each chunk, without messing with the line order; the resulting files are yx000000, yx000001, ... (since awk can't portably do changes in place)
  • cat $(sort -n yx*) | sort | uniq -d reassembles the chunks and finds a list of duplicates; because of the way the chunks were constructed, each duplicated line can appear at most once in each chunk
  • fgrep -x -n "$line" /dev/null yx* finds where each duplicated line lives; the result is a list of lines yx000005:23:some text
  • sort -n | sed 1d removes the first chunk from the list above (this is the first occurrence of the line, and it should be left alone)
  • IFS=: read -r file nr rest splits yx000005:23:some text into file=yx000005, nr=23, and the rest
  • sed -i -e ${nr}d "$file" removes line $nr from chunk $file
  • cat $(sort -n yx*) reassembles the chunks; they need to be sorted, to make sure they come in the right order.

This is probably not very fast, but I'd say it should work. Increasing the number of lines in each chunk from 10000 can speed things up, at the expense of using more memory. The operation is O(N^2) in the number of duplicate lines across chunks; with luck, this wouldn't be too large.

The above assumes GNU sed (for -i). It also assumes there are no files named x* or yx* in the current directory (that's the part that could use some cleanup, perhaps by moving the junk into a directory created by mktemp -d).

Edit: Second version, after feedback from @EtanReisner:

#! /bin/sh
usage ()
{
    echo "Usage:  ${0##*/} <file> [<lines>]" >&2
    exit 1
}


if [ $# -lt 1 -o $# -gt 2 -o ! -f "$1" ]; then usage; fi
if [ "$2" ]; then
    expr "$2" : '[1-9][0-9]*$' >/dev/null || usage
fi

tdir=$(mktemp -d -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1
dupes=$(mktemp -p "${TEMP:-.}" "${0##*/}_$$_XXXXXXXX") || exit 1

trap 'rm -rf "$tdir" "$dupes"' EXIT HUP INT QUIT TERM

LC_ALL=C
export LC_ALL

split -l ${2:-10000} -d -a 6 "$1" "${tdir}/x"

ls -1 "$tdir" | while IFS= read -r x; do
    awk '!x[$0]++' "${tdir}/${x}" >"${tdir}/y${x}" && \
    rm -f "${tdir}/$x" || exit 1
done

find "$tdir" -type f -name 'yx*' | \
    xargs -n 1 cat | \
    sort | \
    uniq -d >"$dupes" || exit 1

find "$tdir" -type f -name 'yx*' -exec fgrep -x -n -f "$dupes" /dev/null {} + | \
    sed 's!.*/!!' | \
    sort -t: -n -k 1.3,1 -k 2,2 | \
    perl '
        while(<STDIN>) {
            chomp;
            m/^(yx\d+):(\d+):(.*)$/o;
            if ($dupes{$3}++)
                { push @{$del{$1}}, int($2) }
            else
                { $del{$1} = [] }
        }
        undef %dupes;

        chdir $ARGV[0];

        for $fn (sort <"yx*">) {
            open $fh, "<", $fn
                or die qq(open $fn: $!);
            $line = $idx = 0;
            while(<$fh>) {
                $line++;
                if ($idx < @{$del{$fn}} and $line == $del{$fn}->[$idx])
                    { $idx++ }
                else
                    { print }
            }
            close $fh
                or die qq(close $fn: $!);
            unlink $fn
                or die qq(remove $fn: $!);
        }
    ' "$tdir" >uniq_"$1" || exit 1
lcd047
  • 5,301
  • 1
  • 23
  • 33
  • This might work (I haven't thought it all through) but this is a tremendously costly approach. Using many times more disk space and requiring many times more passes through the file data to work (and many many calls to `sed`/etc.). – Etan Reisner Jun 18 '15 at 09:52
  • @EtanReisner If you look carefully, the space used on disk aside from the final result is only 2x the size of the input file (the initial file and the split chunks), plus the size of a chunk (at 10k lines per chunk, that's likely to be < 1MB). `sort` might need more, but that's `sort`, not my script. And since there can be only one duplicated line per chunk, `sed` won't be run _that_ many times. That's the entire point of de-duplicating the chunks first. Really, try to understand what's supposed to be going on, it isn't _that_ bad. – lcd047 Jun 18 '15 at 10:56
  • Only twice on disk total but each chunk gets duplicated during the processing loop. And yes, with splits of that size each chunk is likely to be small (but that might be a problem by itself) your glob might blow up your line length limit, directory traversal might slow down greatly since the fs might not handle a directory with that many files well, you might blow up the disk cache with out-of-order reads/writes. You are writing unique lines to the y* files. We know the total number of unique lines is greater than memory holds so it stands to reason each file will have most of those 10k lines. – Etan Reisner Jun 18 '15 at 11:21
  • I'll admit I still haven't thought this through fully so it may not be as bad as I think but it still seems strictly worse than some of the other solutions. But does have the benefit of trading disk for memory which may mean this sort of approach is the only possible approach (depending on data size, disk space and available memory). – Etan Reisner Jun 18 '15 at 11:22
  • @EtanReisner The `x` files are removed as soon as the `yx` ones are created. They aren't duplicated. Overflowing the command line buffer is a detail, it can be addressed with `xargs` and friends if needed (did you read the "proof of concept" warning above?). – lcd047 Jun 18 '15 at 11:28
  • They are duplicated while awk writes. They are removed after I know. I'm not sure you could use xargs to handle that actually on the main `cat | sort` loop (but I could be wrong). In any case the repeated running of `sed` is going to dominate the runtime I expect. That's a *lot* of seeking through the file. Just deleting the last three lines in the file requires seeking through the entire remaining contents of the file three times. – Etan Reisner Jun 18 '15 at 11:35
  • Optimizing the sed runs into a single (or smaller N) amount of runs should be fairly simple and would probably help a lot. Also bash isn't exactly speedy at line-by-line reading of streams (but I don't know if it is slow enough to matter here but it might be). – Etan Reisner Jun 18 '15 at 11:37
  • @EtanReisner _They are duplicated while awk writes._ - Nope: `awk '!x[$0]++' "$x" >"y${x}" && rm -f "$x"`. _I'm not sure you could use xargs to handle that actually on the main `cat | sort` loop_ - `sort -n ... | xargs -n 1 cat | sort ...` – lcd047 Jun 18 '15 at 11:43
  • @EtanReisner _Optimizing the sed runs into a single (or smaller N) amount of runs should be fairly simple_ Now this is an interesting idea. I'll edit my answer later to include that (have to go right now). – lcd047 Jun 18 '15 at 11:44
  • That awk bit reads the original split file and writes a new file. So while awk is running and until the rm runs the file is duplicated. Yes, that is a *vanishingly* small period of time but it does exist and does come with the associated disk cache accesses/etc. – Etan Reisner Jun 18 '15 at 11:51
  • @EtanReisner Hence my note above _2x the size of the input file [...] plus the size of a chunk (at 10k lines per chunk, that's likely to be < 1MB)_. That's the extra chunk. Anyway, I posted an improved version. – lcd047 Jun 18 '15 at 14:55
1

If there's a lot of duplication, one possibility is to split the file using split(1) into manageable pieces and using something conventional like sort/uniq to make a summary of unique lines. This will be shorter than the actual piece itself. After this, you can compare the pieces to arrive at an actual summary.

Noufal Ibrahim
  • 66,768
  • 11
  • 123
  • 160
  • If he can't hold the entire set of unique lines in memory I'm not sure how this would help. At some point he needs to be able to de-duplicate between all the files without sorting them. I suppose there might be a clever N-way de-duplication algorithm that doesn't need all the unique lines in memory at once but I don't know it if there is. – Etan Reisner Jun 18 '15 at 04:04
  • @EtanReisner This isn't an useless approach, removing duplicates across files can be done in almost reasonable time (see my answer). – lcd047 Jun 18 '15 at 08:25
1

Maybe not the answer you've been looking for but here goes: use a bloom filter. https://en.wikipedia.org/wiki/Bloom_filter This sort of problem is one of the main reasons they exist.

Mircea
  • 9,314
  • 2
  • 25
  • 42