Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? 243
postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
findimagedupes in Debian (Score:5, Interesting)
whatever you decide on, it could probably be done in a hundred lines of perl
Funny you mention perl.
There's a tool written in perl called "findimagedupes" in Debian [debian.org]. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.
Re:Seriously? (Score:4, Interesting)
Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.
Quick shell script using exiftool (Score:5, Interesting)
This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie [sourceforge.net] has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).
I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.
Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):
#!/bin/sh
for image in "$@"; do
echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"
done
And then run:
find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output
If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).
There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.
On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).
Re:findimagedupes in Debian (Score:4, Interesting)
Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.
Re:write it yourself (Score:4, Interesting)
ExifTool is probably your best start:
http://www.sno.phy.queensu.ca/~phil/exiftool/
find . -print0 | xargs -0 md5sum | sort -flags | uniq -flags
There are flags in uniq to let you see pairs of identical md5sums as a pair.
Multiple machines drag the full file to the next machine and concat the
local files....
Yes exif helps. but some editors attach exif data from the original...
The serious might cmp files as well before deleting.