Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?
Software Linux

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? 243

Posted by timothy
from the which-ones-are-not-like-the-others? dept.
postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
This discussion has been archived. No new comments can be posted.

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

Comments Filter:
  • by nemesisrocks (1464705) on Thursday January 23, 2014 @06:47PM (#46051473) Homepage

    whatever you decide on, it could probably be done in a hundred lines of perl

    Funny you mention perl.

    There's a tool written in perl called "findimagedupes" in Debian []. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.

  • Re:Seriously? (Score:4, Interesting)

    by postbigbang (761081) on Thursday January 23, 2014 @06:56PM (#46051557)

    Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.

  • by Khopesh (112447) on Thursday January 23, 2014 @07:07PM (#46051689) Homepage Journal

    This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie [] has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).

    I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.

    Something like this (sorry, slashdot eats whitespace so this is not indented):

    for image in "$@"; do
    echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image"

    And then run:

    find [list of paths] -typef -print0 |xargs -0 |sort > output

    If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).

    There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.

    On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).

  • by msobkow (48369) on Thursday January 23, 2014 @07:12PM (#46051739) Homepage Journal

    Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...

    From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.

  • by niftymitch (1625721) on Thursday January 23, 2014 @11:31PM (#46053421)

    ExifTool is probably your best start:

    find . -print0 | xargs -0 md5sum | sort -flags | uniq -flags

    There are flags in uniq to let you see pairs of identical md5sums as a pair.

    Multiple machines drag the full file to the next machine and concat the
    local files....

    Yes exif helps. but some editors attach exif data from the original...
    The serious might cmp files as well before deleting.

Money is the root of all evil, and man needs roots.