7

I want to write an algorithm about bash that it finds duplicate files

How can I add size option?

user2913020
  • 79
  • 1
  • 1
  • 4

7 Answers7

19

Don't reinvent the wheel, use the proper command :

fdupes -r dir

See http://code.google.com/p/fdupes/ (packaged on some Linux distros)

Gilles Quenot
  • 143,367
  • 32
  • 199
  • 195
  • 1
    `fdupes` has quite serious performance issues. I was trying to find dupes for around 200 very large video files. `fdupes` took around an hour to scan it. My script which does few simple tricks took around 4 minutes. – Ondra Žižka May 17 '16 at 00:45
  • On Mac you can "brew install fdupes". I liked this because you can pass an extra command to auto delete the second reference. Performance-wise I ran this across a folder of 50,000 videos and images (some videos 1GB or larger) and it completed within 2-minutes or so. Be sure to do a test run first before deleting. – Mauvis Ledford Jan 02 '20 at 06:53
17
find . -not -empty -type f -printf "%s\n" | sort -rn | uniq -d |\
xargs -I{} -n1 find . -type f -size {}c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

This is how you'd want to do it. This code locates dups based on size first, then MD5 hash. Note the use of -size, in relation to your question. Enjoy. Assumes you want to search in the current directory. If not, change the find . to be appropriate for for the directory(ies) you'd like to search.

chicks
  • 1,901
  • 2
  • 19
  • 32
Alex Atkinson
  • 824
  • 4
  • 8
  • Why still using md5 when sha1 is often installed by default ? md5 have collision issue since this discovery around one decade – Gilles Quenot Dec 18 '20 at 00:10
  • Nice catch, @GillesQuenot, but both md5 and sha1 are broken. These days I'd suggest using sha256/sha512, but to be pragmatic when weighing performance to security gains. md5sum/sha1 are still fine for some use cases. – Alex Atkinson Jan 25 '21 at 18:01
  • How can we amend this to delete all occurrences of duplicates except the first? Preferably keeping the output to inform us of what's happening. – Redsandro Mar 28 '21 at 17:45
2

find /path/to/folder1 /path/to/folder2 -type f -printf "%f %s\n" | sort | uniq -d

The find command looks in two folders for files, prints file name only (stripping leading directories) and size, sort and show only dupes. This does assume there are no newlines in the file names.

Drake Clarris
  • 1,053
  • 6
  • 10
1

Normally I use fdupes -r -S .. But when I search for duplicates of lower amount of very large files, fdupes takes very long to finish as it does a full checksum of the whole file (I guess).

I've avoided that by comparing only the first 1 megabyte. It's not super-safe and you have to check if it's really a duplicate if you want to be 100 % sure. But the chance of two different videos (my case) having the same 1st megabyte but different further content is rather theorethical.

So I have written this script. Another trick it does to speed up is that it stores the resulting hash for certain path into a file. I rely on the fact that the files don't change.

I paste this code to a console rather than running it - for that, it would need some more work, but here you have the idea:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
  echo -n '.'
  if grep -q "$i" md5-partial.txt; then
    echo -n ':'; #-e "\n$i  ---- Already counted, skipping.";
    continue;
  fi
  MD5=`dd bs=1M count=1 if="$i" status=none | md5sum`
  MD5=`echo $MD5 | cut -d' ' -f1`
  if grep "$MD5" md5-partial.txt; then echo -e "Duplicate: $i"; fi
  echo $MD5 $i >> md5-partial.txt
done
fi

## Show the duplicates
#sort md5-partial.txt | uniq  --check-chars=32 -d -c | sort -b -n | cut -c 9-40 | xargs -I '{}' sh -c "grep '{}'  md5-partial.txt && echo"

Another bash snippet which use to determine the largest duplicate files:

## Show wasted space
if [ false ] ; then
sort md5-partial.txt | uniq  --check-chars=32 -d -c | while IFS= read -r -d '' LINE; do
  HASH=`echo $LINE | cut -c 9-40`;
  PATH=`echo $LINE | cut -c 41-`;
  ls -l '$PATH' | cud -c 26-34
done

Both these scripts have a lot of space for improvements, feel free to contribute - here is the gist :)

Ondra Žižka
  • 36,997
  • 35
  • 184
  • 250
1

This might be a late answer, but there are much faster alternatives to fdupes now.

  1. fslint/findup
  2. jdupes, which is supposed to be a faster replacement for fdupes

I have had the time to do a small test. For a folder with 54,000 files of a total size 17G, on a standard (8 vCPU/30G) Google Virtual Machine:

  • fdupes takes 2m 47.082s
  • findup takes 13.556s
  • jdupes takes 0.165s

However, my experience is that, if your folder is too large, the time might become very long too (hours, if not days) since pairwise comparison (or sorting at best) and extremely memory-hungry operations soon become unbearably slow. Runnig a task like this on an entire disk is out of the question.

Peacher Wu
  • 161
  • 1
  • 4
0

You can make use of cmp to compare file size like this:

#!/bin/bash

folder1="$1"
folder2="$2"
log=~/log.txt

for i in "$folder1"/*; do
    filename="${i%.*}"
    cmp --silent "$folder1/$filename" "$folder2/$filename" && echo "$filename" >> "$log"
done
anubhava
  • 664,788
  • 59
  • 469
  • 547
0

If you can't use *dupes for any reason and the number of files is very high the sort+uniq won't have a good performance. In this case you could use something like this:

find . -not -empty -type f -printf "%012s" -exec md5sum {} \; | awk 'x[substr($0, 1, 44)]++'

find will create a line for each file with the filesize in bytes (I used 12 positions but YMMV) and the md5 hash of the file (plus the name).
awk will filter the results without the need of being sorted previously. The 44 stands for 12 (for the filesize) + 32 (length of the hash). If you need some explanation about the awk program you can see the basics here.

jlaraval
  • 3
  • 1
  • 2