Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? 243

Posted by timothy on Thursday January 23, 2014 @06:32PM from the which-ones-are-not-like-the-others? dept.

postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

This discussion has been archived. No new comments can be posted.

Search 243 Comments Log In/Create an Account

Comments Filter:

fdupes -rd (Score:5, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @06:37PM (#46051337)

I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.
fdupes will do a bit-wise comparison. -r = recurse. -d = delete.
fdupes would be the fastest way.

Re:write it yourself (Score:5, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @06:38PM (#46051341)

ExifTool is probably your best start:
http://www.sno.phy.queensu.ca/~phil/exiftool/

Write a quick script. (Score:5, Informative)

by khasim ( 1285 ) writes: <brandioch.conner@gmail.com> on Thursday January 23, 2014 @06:40PM (#46051377)

If they are identical then their hashes should be identical.
So write a script that generates hashes for each of them and checks for duplicate hashes.

fslint (Score:3, Informative)

by innocent_white_lamb ( 151825 ) writes: on Thursday January 23, 2014 @06:40PM (#46051385)

fslint is a toolkit to find all redundant disk usage (duplicate files
for e.g.). It includes a GUI as well as a command line interface.
http://www.pixelbeat.org/fslin... [pixelbeat.org]

Geeqie (Score:5, Informative)

by zakkie ( 170306 ) writes: on Thursday January 23, 2014 @06:42PM (#46051425) Homepage

Works excellently for this.

Re:ZFS dedup (Score:3, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @06:53PM (#46051519)

Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.

General case (Score:5, Informative)

by xaxa ( 988988 ) writes: on Thursday January 23, 2014 @07:00PM (#46051603)

For the general case (any file), I've used this script:
#!/bin/sh
OUTF=rem-duplicates.sh;
echo "#! /bin/sh" > $OUTF;
find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
chmod a+x $OUTF; ls -l $OUTF
It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.
(Also, isn't this really a question for superuser.com or similar?)

Re:write it yourself (Score:5, Informative)

by shipofgold ( 911683 ) writes: on Thursday January 23, 2014 @07:01PM (#46051619)

I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates
Here is a quick command which will rename every file in a directory according to createDate
exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR
If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (Score:5, Informative)

by rwa2 ( 4391 ) * writes: on Thursday January 23, 2014 @07:04PM (#46051655) Homepage Journal

Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
Here are some others:
findimagedupes
Finds visually similar or duplicate images
findimagedupes is a commandline utility which performs a rough "visual diff" to
two images. This allows you to compare two images or a whole tree of images and
determine if any are similar or identical. On common image types,
findimagedupes seems to be around 98% accurate.
Homepage: http://www.jhnc.org/findimaged... [jhnc.org]
fslint :
kleansweep :
File cleaner for KDE
KleanSweep allows you to reclaim disk space by finding unneeded files. It can
search for files basing on several criterias; you can seek for:
* empty files
* empty directories
* backup files
* broken symbolic links
* broken executables (executables with missing libraries)
* dead menu entries (.desktop files pointing to non-existing executables)
* duplicated files ...
Homepage: http://linux.bydg.org/~yogin/ [bydg.org]
komparator :
directories comparator for KDE
Komparator is an application that searches and synchronizes two directories. It
discovers duplicate, newer or missing files and empty folders. It works on
local and network or kioslave protocol folders.
Homepage: http://komparator.sourceforge.... [sourceforge.net]
backuppc : (just in case this was related to your intended use case for some reason)
high-performance, enterprise-grade system for backing up PCs
BackupPC is disk based and not tape based. This particularity allows features #
not found in any other backup solution:
* Clever pooling scheme minimizes disk storage and disk I/O. Identical files
across multiple backups of the same or different PC are stored only once
resulting in substantial savings in disk storage and disk writes. Also known
as "data deduplication".
I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.
That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.

Re:Seriously? (Score:5, Informative)

by zakkie ( 170306 ) writes: on Thursday January 23, 2014 @07:14PM (#46051753) Homepage

See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.

Re:You don't need software for this (Score:4, Informative)

by unrtst ( 777550 ) writes: on Thursday January 23, 2014 @07:15PM (#46051761)

Adjust as needed:
find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done
...though I think something more sophisticated than an md5sum would be wise (exif data could have been changed but nothing else, and you'd miss that dupe).

Re:findimagedupes in Debian (Score:3, Informative)

by nemesisrocks ( 1464705 ) writes: on Thursday January 23, 2014 @07:30PM (#46051879) Homepage

Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage [jhnc.org]), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).
I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.

Re:General case (Score:4, Informative)

by Forever Wondering ( 2506940 ) writes: on Thursday January 23, 2014 @07:31PM (#46051883)

(Also, isn't this really a question for superuser.com or similar?)
Possibly ;-)
http://superuser.com/questions... [superuser.com]

http://en.wikipedia.org/wiki/List_of_duplicate_fil (Score:2, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @07:49PM (#46052075)

http://en.wikipedia.org/wiki/List_of_duplicate_file_finders

Re:I think I wrote one of these. (Score:2, Informative)

by VortexCortex ( 1117377 ) writes: <VortexCortex@pro ... m minus language> on Thursday January 23, 2014 @07:53PM (#46052109)

This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow".
Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it. I can't be fucked to care if folks are too stupid to learn how to use their computers. If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.
This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one
Which is why it's a nightmare to administer windows. MS had to create a fucking scripting terminal "powershell" because they ditched DOS and didn't expose OS features to a terminal... Now go press the Towel key to open Window8's start screen. Start typing... AT A NEUTERED TERMINAL... ugh. Sometimes, its better to not have to wait for someone to create something for you, especially when it's something very easy to do. You would FIRE a secretary that could not sort a set of physical files by customer ID and remove duplicates, or add up totals with a calculator, etc. Your standard for computer "operator" is so low it's pitiable.
If you paid attention to the thread, you'd have noticed that nothing you said about Windows is exclusive to windows. Indeed, a Google search for any OS would have turned up solutions for it. Some would be a few lines of BASH or Perl, Powershell, BATCH scripts, etc. Some would be 'free' programs, some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't.
The OS exposes your computer's features to you. If you do not know how to write a simple set of instructions for it to follow, then you do not know how to use a computer.

Re:write it yourself (Score:3, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @09:47PM (#46052921)

I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.

I don't know if it will work under Wine, but it's worth a try.

Re:Hashes should be relatively easy (Score:4, Informative)

by TsuruchiBrian ( 2731979 ) writes: on Thursday January 23, 2014 @10:25PM (#46053113)

md5 is a 128bit hash. Assuming your not trying to create collisions, the odds of you getting a collision in n files is:
p = 1 - (2^128)! / ((2^128 - n)! * (2^128)^n)
This is an expression that starts at 0 and gradually goes to 1 as n goes to infinity.
These numbers are so big, I have no idea how to even solve for n to get something like p = 0.0001%, without using a bignumber package, but I imagine n would have to be *REALLY* big in order to get a p significantly above 0

Visipics is excellent. (Score:3, Informative)

by micronicos ( 344307 ) writes: on Friday January 24, 2014 @01:25AM (#46053905) Homepage

I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
I don't know if it will work under Wine, but it's worth a try.
Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/ [visipics.info]
Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? 243

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? More Login

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

fdupes -rd (Score:5, Informative)

Re:write it yourself (Score:5, Informative)

Write a quick script. (Score:5, Informative)

fslint (Score:3, Informative)

Geeqie (Score:5, Informative)

Re:ZFS dedup (Score:3, Informative)

General case (Score:5, Informative)

Re:write it yourself (Score:5, Informative)

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (Score:5, Informative)

Re:Seriously? (Score:5, Informative)

Re:You don't need software for this (Score:4, Informative)

Re:findimagedupes in Debian (Score:3, Informative)

Re:General case (Score:4, Informative)

http://en.wikipedia.org/wiki/List_of_duplicate_fil (Score:2, Informative)

Re:I think I wrote one of these. (Score:2, Informative)

Re:write it yourself (Score:3, Informative)

Re:Hashes should be relatively easy (Score:4, Informative)

Visipics is excellent. (Score:3, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot