Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF? 243

Posted by timothy on Thursday January 23, 2014 @06:32PM from the which-ones-are-not-like-the-others? dept.

postbigbang writes "Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."

This discussion has been archived. No new comments can be posted.

Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

Load All Comments

Search 243 Comments Log In/Create an Account

Comments Filter:

write it yourself (Score:2, Insightful)

by retchdog ( 1319261 ) writes:

exactly what you mean by deduplication is kind of vague, but whatever you decide on, it could probably be done in a hundred lines of perl (using CPAN libraries of course).
- Re:write it yourself (Score:5, Informative)
  
  by Anonymous Coward writes: on Thursday January 23, 2014 @06:38PM (#46051341)
  
  ExifTool is probably your best start:
  http://www.sno.phy.queensu.ca/~phil/exiftool/
  
  Parent Share
  twitter facebook
  - Re:write it yourself (Score:5, Informative)
    
    by shipofgold ( 911683 ) writes: on Thursday January 23, 2014 @07:01PM (#46051619)
    
    I second exiftool. Lots of options to rename files. If you rename files based on createtime and perhaps other fields like resolution you will end up with unique filenames and then you can filter the duplicates
    Here is a quick command which will rename every file in a directory according to createDate
    exiftool "-FileNameCreateDate" -d "%Y%m%d_%H%M%S.%%e" DIR
    If the files were all captured with the same device it is probably super easy since the exif info will be consistent. If the files are from lots of different sources...good luck.
    
    Parent Share
    twitter facebook
  - Re: (Score:3, Informative)
    
    by Anonymous Coward writes:
    
    I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
    
    I don't know if it will work under Wine, but it's worth a try.
    - Visipics is excellent. (Score:3, Informative)
      
      by micronicos ( 344307 ) writes:
      
      I use VisiPics for Windows. It's a free software that actually analyses the content of images to find duplicates. This works very well because images may not have exif data or the same image may be different file sizes or formats.
      I don't know if it will work under Wine, but it's worth a try.
      Visipics is the only tool I have ever found that will reliably use image matching to dedupe; it is Windows only but I have used it on my own collections & it works very well indeed: http://www.visipics.info/ [visipics.info]
      Now (v1.31) understands .raw as well as all other main image formats & can handle rotated images; brilliant little program!
      - Re: (Score:3)
        
        by DMUTPeregrine ( 612791 ) writes:
        
        I've used <a href="http://www.duplicate-finder.com/photo.html">Duplicate Photo Finder</a> for a while, but VisiPics looks like it's probably better. That said, I have tested and Duplicate Photo Finder worked for me with WINE.
  - Re:write it yourself (Score:4, Interesting)
    
    by niftymitch ( 1625721 ) writes: on Thursday January 23, 2014 @11:31PM (#46053421)
    
    ExifTool is probably your best start:
    http://www.sno.phy.queensu.ca/~phil/exiftool/
    find . -print0 | xargs -0 md5sum | sort -flags | uniq -flags
    There are flags in uniq to let you see pairs of identical md5sums as a pair.
    Multiple machines drag the full file to the next machine and concat the
    local files....
    Yes exif helps. but some editors attach exif data from the original...
    The serious might cmp files as well before deleting.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by postbigbang ( 761081 ) writes:
  
  Imagine tons of iterative backups of photos. Generations of backups. Now they need consolidation. Something that can look at file systems, vacuum the files-- but only one of each photo, even if there are many copies of that photo, as in myphoto(1).jpg, etc.
  - General case (Score:5, Informative)
    
    by xaxa ( 988988 ) writes: on Thursday January 23, 2014 @07:00PM (#46051603)
    
    For the general case (any file), I've used this script:
    #!/bin/sh
    OUTF=rem-duplicates.sh;
    echo "#! /bin/sh" > $OUTF;
    find "$@" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w 32 -d --all-repeated=separate | sed -r 's/^[0-9a-f]*( )*//;s/([^a-zA-Z0-9./_-])/\\\1/g;s/(.+)/#rm \1/' >> $OUTF;
    chmod a+x $OUTF; ls -l $OUTF
    It should be straightforward to change "md5sum" to some other key -- e.g. EXIF Date + some other EXIF fields.
    (Also, isn't this really a question for superuser.com or similar?)
    
    Parent Share
    twitter facebook
    - Re:General case (Score:4, Informative)
      
      by Forever Wondering ( 2506940 ) writes: on Thursday January 23, 2014 @07:31PM (#46051883)
      
      (Also, isn't this really a question for superuser.com or similar?)
      Possibly ;-)
      http://superuser.com/questions... [superuser.com]
      
      Parent Share
      twitter facebook
      - Re:General case (Score:4, Funny)
        
        by Yakasha ( 42321 ) writes: on Thursday January 23, 2014 @09:21PM (#46052753) Homepage
        
        (Also, isn't this really a question for superuser.com or similar?)
        Possibly ;-) http://superuser.com/questions... [superuser.com]
        So adapt the script to de-dupe stories?
        But then if we did that... what would we read on /.?
        
        Parent Share
        twitter facebook
        
        Re: (Score:2)
        
        by ihtoit ( 3393327 ) writes:
        
        oh, there'll be plenty of spamvertisements for penis extensions buried in the sea of rejected submissions somewhere...
  - Re: (Score:2)
    
    by vux984 ( 928602 ) writes:
    
    If the files are in fact identical internally, just backups and backups of backups then it should be pretty straightforward.
    Simplest would be simply to:
    start with an empty destination
    Compare each file in the source(s) tree(s) on each file in the destination by filesize in bytes, then if there is a match there, do a file compare using cmp. Copy it to the destination it if it doesn't match, otherwise move to the next file. Seems like something that would take 10-20 lines of command line script tops. Its a one
    - I wrote one myself (Score:4, Insightful)
      
      by tepples ( 727027 ) writes: <tepples.gmail@com> on Thursday January 23, 2014 @10:14PM (#46053045) Homepage Journal
      
      What I did in my deduplicator written in Python [pineight.com] was group the files by their and reject any file with a unique size. Then I'd hash the first few kilobytes of each file with MD5 (it's just a spot check so speed is more valuable than security against intentional collisions) and reject any file with a unique first few kilobytes. Finally I'd hash the whole file with a more secure hash.
      
      Parent Share
      twitter facebook
- Write a quick script. (Score:5, Informative)
  
  by khasim ( 1285 ) writes: <brandioch.conner@gmail.com> on Thursday January 23, 2014 @06:40PM (#46051377)
  
  If they are identical then their hashes should be identical.
  So write a script that generates hashes for each of them and checks for duplicate hashes.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by thorgil ( 455385 ) writes:
  
  or python, using 10 lines.
- findimagedupes in Debian (Score:5, Interesting)
  
  by nemesisrocks ( 1464705 ) writes: on Thursday January 23, 2014 @06:47PM (#46051473) Homepage
  
  whatever you decide on, it could probably be done in a hundred lines of perl
  Funny you mention perl.
  There's a tool written in perl called "findimagedupes" in Debian [debian.org]. Pretty awesome tool for large image collections, because it could identify duplicates even if they had been resized, or messed with a little (e.g. adding logos, etc). Point it at a directory, and it'll find all the dupes for you.
  
  Parent Share
  twitter facebook
  - Re:findimagedupes in Debian (Score:4, Interesting)
    
    by msobkow ( 48369 ) writes: on Thursday January 23, 2014 @07:12PM (#46051739) Homepage Journal
    
    Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
    From what this user is talking about (multiple drives full of images), they may well have reached the point where it is impossible to sort out the dupes without one hell of a heavy hitting cluster to do the comparisons and sorting.
    
    Parent Share
    twitter facebook
    - Re: (Score:3)
      
      by complete loony ( 663508 ) writes:
      
      What you want, is a first pass which identifies some interesting points in the image. Similar to microsoft's photosynth. Then you can compare this greatly simplified data for similar sets of points. Allowing you to ignore the effects of scaling or cropping.
      A straight hash won't identify similarities between images, and would be totally confused by compression artefacts.
      - SIFT is patented (Score:3)
        
        by tepples ( 727027 ) writes:
        
        What you want, is a first pass which identifies some interesting points in the image.
        There is an algorithm for that called SIFT (scale-invariant feature transform), but it's patented and apparently unavailable for licensing in free software.
    - Re: (Score:3, Informative)
      
      by nemesisrocks ( 1464705 ) writes:
      
      Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
      It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage [jhnc.org]), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).
      I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is
      - Re: (Score:2)
        
        by sexconker ( 1179573 ) writes:
        
        Why do I have this sneaking suspicion it runs in exponential time, varying as the size of the data set...
        It's actually pretty nifty how findimagedupes works. It creates a 16x16 thumbnail of each image (it's a little more complicated than that -- read more on the manpage [jhnc.org]), and uses this as a fingerprint. Fingerprints are then compared using an algorithm that looks like O(n^2).
        I doubt the difference between O(2^n) and O(n^2) would make a huge impact anyway: the biggest bottleneck is going to be disk read and seek time, not comparing fingerprints. It's akin to running compression on a filesystem: read speed is an order of magnitude slower than the compression.
        O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
        You have to read all the images and generate the hashes, but that's Theta(n).
        Comparing one hash to every other has is Theta(n^2).
        If the hashes are small enough to all live in memory (or enough of them that you can intelligently juggle your comparisons without having to wait on the disk too much), then you'll be fine for tens of thousands of pictures.
        But photographers can take thousands of pictures per shoot, hundreds of
        
        Re: (Score:2)
        
        by TsuruchiBrian ( 2731979 ) writes:
        
        O(n^2) vs O(2^n) is a huge difference eve for very small datasets (hundreds of pictures).
        Hopefully it's actually something like O(p * 2^n) vs O(p * n^2) where n is the thumbnail size and p is the number of images.
    - Re: (Score:2)
      
      by safetyinnumbers ( 1770570 ) writes:
      
      I've used findimagedups. IIRC, it rescales each image to a standard size (64x64 or something) then filters and normalizes it down to a 1-bit-depth image.
      
      It then builds a database of these 'hashes'/'signatures' and can output a list of files that have a threshold of bits in common.
      
      That's how it can ignore small changes, it loses most detail and then can ignore a threshold of differences.
      
      It would fail if an image was cropped or rotated, for instance. It could handle picture orientation it it was modified
  - Re: (Score:2)
    
    by buchner.johannes ( 1139593 ) writes:
    
    The real answer is to make a hash over the image content. The ImageHash python package [python.org] comes with a program to discover duplicate images. It is more powerful than what is needed here: It can find images that looks similar (different format, resolution, etc.).
    I think the ImageHash package uses a better algorithm than findimagedupes (description here [github.com], actually you can choose between several), and is shorter in code.
- Why use perl? (Score:2)
  
  by Arker ( 91948 ) writes:
  
  Why use perl when a bash script will do?
- Re: (Score:2)
  
  by A nonymous Coward ( 7548 ) writes:
  
  I wrote a file deduplicator. Build a table of file size ---> name. If two files have the same size, run md5sum on them or just use cmp -s. It's a trivial program.
  But if you have photos which you consider duplicates but which have different sizes or checksums, then it's a visual gig and lots of boring tedious work,
- - One liner (Score:2, Offtopic)
    
    by tepples ( 727027 ) writes:
    
    Here's Gus [youtube.com], and here's your one liner:
    wget http://pineight.com/pc/dedupe.py.zip; unzip dedupe.py.zip; python dedupe.py
fdupes -rd (Score:5, Informative)

by Anonymous Coward writes: on Thursday January 23, 2014 @06:37PM (#46051337)

I've had the same problem as I stupidly try to make the world a better place by renaming or putting them in sub-directories.
fdupes will do a bit-wise comparison. -r = recurse. -d = delete.
fdupes would be the fastest way.

Share
twitter facebook
- Re: (Score:2)
  
  by Xolotl ( 675282 ) writes:
  
  fdupes is excellent and I second that (please mod the parent up!)
  The only drawback to fdupes is that the files must be identical, so two identical images but where one has some additional metadata e.g. inside the EXIF won't be deduplicated.
fslint (Score:3, Informative)

by innocent_white_lamb ( 151825 ) writes: on Thursday January 23, 2014 @06:40PM (#46051385)

fslint is a toolkit to find all redundant disk usage (duplicate files
for e.g.). It includes a GUI as well as a command line interface.
http://www.pixelbeat.org/fslin... [pixelbeat.org]

Share
twitter facebook
Fuzzy Hashing (Score:3)

by Oceanplexian ( 807998 ) writes: on Thursday January 23, 2014 @06:41PM (#46051399) Homepage

I would try running all the files through ssdeep [sourceforge.net].

You could script it to find a certain % match that you're satisfied with. Only catch to this is that it could be a very time-intensive process to scan a huge number of files. Exif might be a faster option which could be cobbled together in Perl pretty quickly, but that wouldn't catch dupes that had their exif stripped or have slight differences due to post-processing.

Share
twitter facebook
- Re: (Score:2)
  
  by stenvar ( 2789879 ) writes:
  
  That's useless for many kinds of compressed files, like images and audio.
I think I wrote one of these. (Score:2)

by paradxum ( 67051 ) writes:

I'm pretty sure I wrote something like this in perl/bash in like 20 minutes.
1 - do an md5sum of each file and toss it in a file
2 - sort
3 - perl (or you language of choice) program, basicly:
sum = "a"
newsum = next line
if newsum == sum delete file
else sum = newsum
- Re: (Score:3, Insightful)
  
  by Cummy ( 2900029 ) writes:
  
  Why do people on this site believe that everyone who is interested in tech is a programmer? This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow". If that seems like a ridiculous ask, then so is asking a person without the skill to write a script for that. So it can be done in 20 minutes, use that 20 minutes to help someone by writing the p
  - Re: (Score:2, Informative)
    
    by VortexCortex ( 1117377 ) writes:
    
    This"just write it" is foolishness of the highest order. For many of us non-programers "just write it" is like telling some one living in Florida to "just build a plane and fly to that concert in Vienna after work tomorrow".
    Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it. I can't be fucked to care if folks are too stupid to learn how to use their computers. If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer. You know how to use some applications and input devices. It's a big difference.
    This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already, and if there isn't one someone will soon write one
    Which is why it's a nightmare to administer windows.
    - Re: (Score:2)
      
      by kesuki ( 321456 ) writes:
      
      "some of those would have adware, some would have malware. At least the ones in the FLOSS repositories wouldn't."
      repositories are a layer of security. yet malware repos are widely promoted on some websites of so called help doing things like playing back movies configuring firewalls etc, also trusted repos are in fact compromized sometimes like http://www.techrepublic.com/blog/linux-and-open-source/linux-repository-hit-by-malware-attack/2989/ [techrepublic.com]
      i remember one site no link as i forgot where i found it, was a gu
    - Re: (Score:2)
      
      by tftp ( 111690 ) writes:
      
      Computer literacy used to involve typing a terminal command. All the PC folks in the 80's and 90's did it.
      Yes, all the 0.07% of the population. The rest was in fear of the computer, for a good reason. Back then computers were not very useful unless you were a programmer, or your specific need was covered (MS Word, Excel, WP.)
      If you can't "write it yourself" in this instance, which amounts to running an operation across a set of files, then sorting the result, then you do not know how to use a computer
    - Re: (Score:2)
      
      by DMUTPeregrine ( 612791 ) writes:
      
      Image deduplication is a much harder problem than you (and many of the posters here) seem to think. It's certainly not terrifically hard, but it's not as simple as comparing file size and content hash.
      What if the image was resized?
      What if a watermark was added?
      What if the image was saved in a different format, eg PNG and JPEG?
      What if the image had its lighting curves adjusted?
      etc.
      You may still want to find these duplicates, but size/hash methods will fail.
      The findimagedupes tool works well in most of these
  - Re: (Score:2)
    
    by jedidiah ( 1196 ) writes:
    
    > This is part of the reason Windows is successful: think of a problem, there is likely program out there that solves it already
    No. Windows is successful because it's the followup to a product that already owned the market: MS-DOS.
    Now you want to talk about nasty user hostile shit, MS-DOS has "script it yourself" Unix beat by a wide margin.
  - Re: (Score:2)
    
    by turbidostato ( 878842 ) writes:
    
    "Why do people on this site believe that everyone who is interested in tech is a programmer?"
    A Bash one-liner or even a 100-line script doesn't make you a programmer.
    On the other hand, if asked "how I do move this car from here to a town 100 miles away" the answer is "the most cheap and efficient way is for you to drive it there" and whinning "why do people on this site believe that I should learn to drive" is just that: whinning.
    Oh, and learning to drive will help you a lot of times, not, only on this task
- Re: (Score:2)
  
  by BetterThanCaesar ( 625636 ) writes:
  Step one is to compare file sizes. Since file sizes need to be identical in order for the files to be identical, and file sizes are already calculated and stored as metadata, this will greatly reduce the time needed.
  
  List all files with their respective sizes.
  Sort
  For each consecutive file in the list with the same size as the previous file, compare the MD5 hashes.
- - Re: (Score:2)
    
    by ihtoit ( 3393327 ) writes:
    
    or, in DOS/Win7 CLI: "dir /s /os >filelist" returns the entire tree contents from the current directory sorted in ascending file size order to the text file "filelist". 10,070 files/6359 folders (random tree search on my hard drive) took 16 seconds.
    Import tab-delimited list into your favourite spreadsheet.
    Do what you need to do.
Geeqie (Score:5, Informative)

by zakkie ( 170306 ) writes: on Thursday January 23, 2014 @06:42PM (#46051425) Homepage

Works excellently for this.

Share
twitter facebook
- Re: (Score:3)
  
  by subreality ( 157447 ) writes:
  
  +1. The reason: it has a fuzzy-matching dedupe feature. It'll crawl all your images, then show them grouped by similarity and let you choose which ones to delete. It seems to do a pretty good job with recompressed or slightly cropped images.
  Open it up, right click a directory, Find Duplicates Recursive.
  fdupes is also good to weed out the bit-for-bit identical files first.
Don't reinvent the wheel: fdupes, md5deep, gqview (Score:3)

by Jody Bruchon ( 3404363 ) writes: on Thursday January 23, 2014 @06:43PM (#46051443)

fdupes will work and is faster than writing a homemade script for the job. The big problem is "across multiple machines" which might require use of, say, sshfs to bring all the machines' data remotely onto one temporarily for duplicate scanning. fdupes checks sizes first, and only then starts trying to hash anything, so obvious non-duplicates don't get hashed at all. Significant time savings. Across multiple machines, another option is using md5deep to build recursive hash lists.

The only tool so far that I've used for image duplicate finding that checks CONTENT rather than bitwise 1:1 duplicate checking is GQview on Linux. It works fairly well, though it's a bit dated by now it's still a good viewer program. Add -D_FILE_OFFSET_BITS=64 to the CFLAGS if you compile it yourself on a 32-bit machine today though.

Share
twitter facebook
- Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (Score:5, Informative)
  
  by rwa2 ( 4391 ) * writes: on Thursday January 23, 2014 @07:04PM (#46051655) Homepage Journal
  
  Yeah, this Ask Slashdot should really be about teaching people how to search for packages in aptitude or whatever your package manager is...
  Here are some others:
  findimagedupes
  Finds visually similar or duplicate images
  findimagedupes is a commandline utility which performs a rough "visual diff" to
  two images. This allows you to compare two images or a whole tree of images and
  determine if any are similar or identical. On common image types,
  findimagedupes seems to be around 98% accurate.
  Homepage: http://www.jhnc.org/findimaged... [jhnc.org]
  fslint :
  kleansweep :
  File cleaner for KDE
  KleanSweep allows you to reclaim disk space by finding unneeded files. It can
  search for files basing on several criterias; you can seek for:
  * empty files
  * empty directories
  * backup files
  * broken symbolic links
  * broken executables (executables with missing libraries)
  * dead menu entries (.desktop files pointing to non-existing executables)
  * duplicated files ...
  Homepage: http://linux.bydg.org/~yogin/ [bydg.org]
  komparator :
  directories comparator for KDE
  Komparator is an application that searches and synchronizes two directories. It
  discovers duplicate, newer or missing files and empty folders. It works on
  local and network or kioslave protocol folders.
  Homepage: http://komparator.sourceforge.... [sourceforge.net]
  backuppc : (just in case this was related to your intended use case for some reason)
  high-performance, enterprise-grade system for backing up PCs
  BackupPC is disk based and not tape based. This particularity allows features #
  not found in any other backup solution:
  * Clever pooling scheme minimizes disk storage and disk I/O. Identical files
  across multiple backups of the same or different PC are stored only once
  resulting in substantial savings in disk storage and disk writes. Also known
  as "data deduplication".
  I bet if you throw Picasa at your combined images directory, it might have some kind of "similar image" detection too, particularly since its sorts everything by exif timestamp.
  That said, I've never had to use any of this stuff, because my habit was to rename my camera image dumps to a timestamped directory (e.g. 20140123_DCIM ) to begin with, and upload it to its final resting place on my main file server immediately, so I know all other copies I encounter on other household machines are redundant.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by rgbe ( 310525 ) writes:
    
    I use fslint. It does more than just find duplicate images.
- - Re: (Score:2)
    
    by Jody Bruchon ( 3404363 ) writes:
    
    I always use: fdupes -nrSd *
Anti-Twin (Score:2)

by MatthiasF ( 1853064 ) writes:

Requires WINE but should work fine on Linux.

http://www.anti-twin.com/ [anti-twin.com]
fdupes (Score:2)

by ender8282 ( 1233032 ) writes:

Under *buntu
sudo apt-get install fdupes
man fdupes:
fdupes - finds duplicate files in a given set of directories
Photo managers (Score:3)

by MrEricSir ( 398214 ) writes: on Thursday January 23, 2014 @06:53PM (#46051515) Homepage

As a former Shotwell dev I might point out that most photo manager apps can do this.

Share
twitter facebook
- Re: (Score:2)
  
  by ihtoit ( 3393327 ) writes:
  
  oh? Like, say, Irfanview? Not that I've ever had the urge to go looking...
Consider git-annex (Score:2)

by dondelelcaro ( 81997 ) writes:

In addition to the other methods (ZFS, fdupes, etc), I personally use git-annex.
Git annex can even run on android, so I keep at least two copies of my photos spread throughout all of my computers and removable devices.
DigicaMerge (Score:2)

by jalet ( 36114 ) writes:

See http://www.librelogiciel.com/s... [librelogiciel.com]
I haven't modified nor used it in years (I don't own a digital camera anymore...) so I ignore if it still works with up to date libraries, but its "--nodupes" option does what you want, and its numerous other command line options (http://www.librelogiciel.com/software/DigicaMerge/commandline) help you solve the main problems of managing directories full of pictures.
It's Free Software, licensed under the GNU GPL of the Free Software Foundation.
Hoping this helps
Quick shell script using exiftool (Score:5, Interesting)

by Khopesh ( 112447 ) writes: on Thursday January 23, 2014 @07:07PM (#46051689) Homepage Journal

This will help find exact matches by exif data. It will not find near-matches unless they have the same exif data. If you want that, good luck. Geeqie [sourceforge.net] has a find-similar command, but it's only so good (image search is hard!). Apparently there's also a findimagedupes tool available, see comments above (I wrote this before seeing that and had assumed apt-cache search had already been exhausted).
I would write a script that runs exiftool on each file you want to test. Remove the items that refer to timestamp, file name, path, etc. make a md5.
Something like this exif_hash.sh (sorry, slashdot eats whitespace so this is not indented):
#!/bin/sh for image in "$@"; do echo "`exiftool |grep -ve 20..:..: -e 19..:..: -e File -e Directory |md5sum` $image" done
And then run:
find [list of paths] -typef -print0 |xargs -0 exif_hash.sh |sort > output
If you have a really large list of images, do not run this through sort. Just pipe it into your output file and sort it later. It's possible that the sort utility can't deal with the size of the list (you can work around this by using grep '^[0-8]' output |sort >output-1 and grep -v '^[0-8]' output |sort >output-2, then cat output-1 output-2 > output.sorted or thereabouts; you may need more than two passes).
There are other things you can do to display these, e.g. awk '{print $1}' output |uniq -c |sort -n to rank them by hash.
On Debian, exiftool is part of the libimage-exiftool-perl package. If you know perl, you can write this with far more precision (I figured this would be an easier explanation for non-coders).

Share
twitter facebook
Use fslint or fslint-gui (Score:2)

by Y2K is bogus ( 7647 ) writes:

fslint is the tool you are looking for.
File system lint FSlint (Score:2)

by Script Cat ( 832717 ) writes:

This will find duplicate files in the general sense:
http://packages.debian.org/sid... [debian.org]
sigh.... (Score:2)

by djsmiley ( 752149 ) writes:

fdupes.
Done :)
http://en.wikipedia.org/wiki/List_of_duplicate_fil (Score:2, Informative)

by Anonymous Coward writes:

http://en.wikipedia.org/wiki/List_of_duplicate_file_finders
Follow the FBI's lead (Score:2)

by Mike Buddha ( 10734 ) writes:

They use a database of hashes of kiddie porn to identify offending material without forcing anyone to look at the stuff. Seems like it would be ready to use Perl to crawl your filesystem and identify dupes.
- Re: (Score:2)
  
  by bluefoxlucid ( 723572 ) writes:
  
  Wrong. OSI explained to us that a person is "victimized" again every time someone looks at an image of them in child porn, and the hash of images is used so that they don't feel that pang in their stomach when an FBI investigator double-clicks 0FEDCABE1.jpg.
Did this years ago (Score:2)

by Enry ( 630 ) writes:

I wrote a shell script that looked at the datestamp for each photo and then moved it to a directory called YYYY/MM/DD (so 2000/12/25). I'm going off the assumption that there weren't two photos taken on the same day with the same filenames. So far that seems to be working.
My solution (Score:3)

by alantus ( 882150 ) writes: on Thursday January 23, 2014 @09:08PM (#46052645)

#!/usr/bin/perl # $Id: findDups.pl 218 2014-01-24 01:04:52Z alan $ # # Find duplicate files: for files of the same size compares md5 of successive chunks until they differ # use strict; use warnings; use Digest::MD5 qw(md5 md5_hex md5_base64); use Fcntl; use Cwd qw(realpath); my $BUFFSIZE = 131072; # compare these many bytes at a time for files of same size my %fileByName; # all files, name => size my %fileBySize; # all files, size => [fname1, fname2, ...] my %fileByHash; # only with duplicates, hash => [fname1, fname2, ...] if ($#ARGV < 0) { print "Syntax: findDups.pl <file|dir> [...]\n"; exit; } # treat params as files or dirs foreach my $arg (@ARGV) { $arg = realpath($arg); if (-d $arg) { addDir($arg); } else { addFile($arg); } } # get filesize after adding dirs, to avoid more than one stat() per file in case of symlinks, duplicate dirs, etc foreach my $fname (keys %fileByName) { $fileByName{$fname} = -s $fname; } # build hash of filesize => [ filename1, filename2, ...] foreach my $fname (keys %fileByName) { push(@{$fileBySize{$fileByName{$fname}}}, $fname); } # for files of the same size: compare md5 of each successive chunk until there is a difference foreach my $size (keys %fileBySize) { next if $#{$fileBySize{$size}} < 1; # skip filesizes array with just one file my %checking; foreach my $fname (@{$fileBySize{$size}}) { if (sysopen my $FH, $fname, O_RDONLY) { $checking{$fname}{fh} = $FH; # file handle $checking{$fname}{md5} = Digest::MD5->new; # md5 object } else { warn "Error opening $fname: $!"; } } my $read=0; while (($read < $size) && (keys %checking > 0)) { my $r; foreach my $fname (keys %checking) { # read buffer and update md5 my $buffer; $r = sysread($checking{$fname}{fh}, $buffer, $BUFFSIZE); if (! defined($r)) { warn "Error reading from $fname: $!"; close $checking{$fname}{fh}; delete $checking{$fname}; } else { $checking{$fname}{md5}->add($buffer); } } $read += $r; FILE1: foreach my $fname1 (keys %checking) { # remove files without dups my $duplicate = 0; FILE2: foreach my $fname2 (keys %checking) { # compare to each checking file next if $fname1 eq $fname2; if ($checking{$fname1}{md5}->clone->digest eq $checking{$fname2}{md5}->clone->digest) { $duplicate = 1; next FILE1; # skip to next file } }
Read the rest of this comment...

Share
twitter facebook
- Re: (Score:2)
  
  by MightyYar ( 622222 ) writes:
  
  I'm replying to you because one of my two solutions has the same name :)
  https://github.com/caluml/find... [github.com]
  I have another solution, written in Python. It is pretty efficient but very limited. It walks two folders, sorting files into bins according to size. If any bins match between the two folders, it does a hash once on each file in each bin and then compares them. That way, the files are not read repeatedly and hashes are only done if necessary. It could be sped up further by only doing partial file matches
  - Re: (Score:2)
    
    by alantus ( 882150 ) writes:
    
    My script is can save some I/O and cpu cycles, but has to keep more files open at a time (could run out of filedescriptors in extreme cases).
    The script you describe must be shorter and easier to understand, but I would only use it for smaller files, where discarding duplicates before reading the whole file doesn't make a big difference.
    The next step is to create some UI that allows deleting duplicates easily.
    - Re: (Score:2)
      
      by MightyYar ( 622222 ) writes:
      
      I was using it where one of the directories was mounted over the network, so I didn't want to read the files unless I had to... a directory listing is a pretty cheap operation. One problem that I ran into was that Macs can add resource forks to some files, so if one of the folders was on a Mac you could have weird file sizes. For photos and pdfs and such, the resource fork is disposable so it was driving me nuts... some "unique" files were not unique at all.
- Re: (Score:2)
  
  by fisted ( 2295862 ) writes:
  
  what a long and convoluted pain.
  consider the POSIX shell variant [slashdot.org]
  - Re: (Score:2)
    
    by alantus ( 882150 ) writes:
    
    It is long and convoluted in the same way that an airplane is long convoluted compared with a bicycle ;)
digiKam is what you want. (Score:2)

by Lurching ( 1242238 ) writes:

DigiKam will do everything you want. It works by creating hashes. You set your level of similarity and digiKam will find the files. It can handle multiple locations, and even "albums" on removable media. If you have a lot of images it can be slow, but if you limit any particular search you can greatly improve performance. It is available for Linux and Windows both.
Obligatory (Score:2)

by multimediavt ( 965608 ) writes:

Google: image duplicate finder [google.com]
Sorry, no app for that (Score:2)

by fisted ( 2295862 ) writes:

However it's fairly easy to do with a unix shell and only standard tools...
Something along the lines of:find /path/to/pics -type f -print0 | xargs -0 md5 | sort | while read hash r; do if ! [ "$lasthash" = "$hash" ]; then echo "$rest" fi lasthash="$hash" done | while read dupe; do echo rm -- "$dupe" done
That would, once the echo is removed, delete all files that are dupes (except one of each).
Typed it right into the /. comment box, though, so it's probably wrong somewhere. Only intended t
If all else fails.... (Score:2)

by ewieling ( 90662 ) writes:

I have used http://www.duplicate-finder.com/photo.html (MS Windows only) because I could not find anything on Linux with similar functionality. It does work very well, it can find similar, but not identical images, such as the same picture saved in a different format or with different compression settings. It tends to slow down when working directories with multiple thousands of images.
fdupes (Score:2)

by the_other_chewey ( 1119125 ) writes:

No need to roll your own. If the redundant files are identical (the
problem as stated lets me assume that), use fdupes.

"Searches the given path for duplicate files. Such files are found by
comparing file sizes and MD5 signatures, followed by a byte-by-byte
comparison."

It's fast, accurate, and generates a list of duplicate files to handle
yourself - or automatically deletes all except the first of duplicate
files found.

I've used it myself with tens of thousands of pictures to exactly do
what the OP wan
- Re: (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  Have you read the zfs documentation? Setting zfs dedup does not remove duplicate files (per OP request, since there are eleven different file systems), but removes redundant storage for files which are duplicates. In other words, if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
  - Count hard links (Score:2)
    
    by tepples ( 727027 ) writes:
    
    if you have the exact same picture in three different folders/subdirectories on the same file system, zfs will only allocate storage for one copy of the data, and point the three file entries to that one copy. Similar to how hard links work in ext2 and friends.
    I think the idea is to use some utility to query ZFS and find files that ZFS has deduplicated. Similar to how one can count hard links to each inode in ext2 and friends.
- - Re: (Score:2)
    
    by ihtoit ( 3393327 ) writes:
    
    I don't know what the price of RAM is doing these days, but I did buy a 4GB upgrade for my laptop last September, cost £19 for the module. ...oh here we go: 8GB Integral PC3-12800 desktop is going for £55 at PC World Retail. 32GB bankfiller would hit £220, you could beat that with a little shopping around I'm sure.
    Laptop SODIMM: same price.
    Seems a bit high to me...
- Re: (Score:2)
  
  by mlts ( 1038732 ) writes:
  
  One can use NTFS and turn on deduplication, then manually fire off the background "optimization" task. It isn't a "presto!", but after a good long while, it will find and merge duplicate files, or duplicate blocks of different files.
  Caveat: This is only Windows 8 and newer, or Windows Server 2012 and newer.
  - Re: (Score:2)
    
    by ericloewe ( 2129490 ) writes:
    
    It's far from ideal. You do get (most of) the storage benefits, but it doesn't help with organisation.
    Filesystem-level deduplication is meant to save space from blocks that several files use (several full image backups will undoubtedly share a large portion of files that belong to the OS and common applications, for instance).
- Re: (Score:2)
  
  by DiSKiLLeR ( 17651 ) writes:
  
  whalah is not a word.... seriously. wtf people. It's voilÃ.
  As for ZFS, sure, I recommend ZFS. But I'm not sure how i feel about ZFS's dedupe. Besides, the multiple files are still there even if it no longer takes up extra space.
  You'd want a script that finds dupes by hash but that will only detect images that are identical copies, not 'simliar' say an image has been cropped or retouched or resized. A program that can find image dupes even with changes like tineye.com would be ideal. Anything like that
  - Re: (Score:2)
    
    by wonkey_monkey ( 2592601 ) writes:
    
    seriously. wtf people. It's voilÃ.
    Well, you tried.
    Quoted for funniness.
    - Re: (Score:3)
      
      by Zontar The Mindless ( 9002 ) writes:
      
      Et voilà! L'UTF, c'est votre ami.
      - Re: (Score:2)
        
        by wonkey_monkey ( 2592601 ) writes:
        
        Zut alors! Mais il n'est pas UTF, maintenant:
        Et voilà! L'UTF, c'est votre ami.
- Re:Seriously? (Score:4, Interesting)
  
  by postbigbang ( 761081 ) writes: on Thursday January 23, 2014 @06:56PM (#46051557)
  
  Yeah. Thanks. It's a simple question. So far, I've seen scripting suggestions, which might be useful. I'm a nerd, but not wanting to do much code because I'm really rusty at it. Instead, I'm amazed that no one runs into this problem and has built an app that does this. That's all I'm looking for: consolidation.
  
  Parent Share
  twitter facebook
  - Re:Seriously? (Score:5, Informative)
    
    by zakkie ( 170306 ) writes: on Thursday January 23, 2014 @07:14PM (#46051753) Homepage
    
    See my earlier contrivution: geeqie. It will even scan for image similarity not just rudimentary hashing. Someone else mentioned gqview & that it was out of date - geeqie is what gqview became.
    
    Parent Share
    twitter facebook
- Re: (Score:3)
  
  by Cley Faye ( 1123605 ) writes:
  
  When you're talking about duplicate content, you can't limit yourself to "just hashes".
  In this case, with pictures, just opening one and saving it again might produce a different hash, just by recompression or changing the file format. How does all these "just check the hashes" solution works for that?
  Finding duplicates image is not that easy.
- Re: (Score:2)
  
  by bluefoxlucid ( 723572 ) writes:
  
  There have been interesting responses. Tools that find substantially-similar (read: the same image lossy encoded, resized, and rotated) images, produce hashes that can be compared to find out "How similar" two images are, and so on.
- - - Re: (Score:2)
      
      by jedidiah ( 1196 ) writes:
      
      When you are destroying data it is far better to err on the side of caution. In this case the solution that "sucks" is more appropriate because it's safer.
      Manually manipulating all the data first kind of totally negates the ponit of trying to automate it.
- - Re:You don't need software for this (Score:4, Informative)
    
    by unrtst ( 777550 ) writes: on Thursday January 23, 2014 @07:15PM (#46051761)
    
    Adjust as needed:
    find ./ -type f -iname '*.jpg' -exec md5sum {} \; > image_md5.txt
    cat image_md5.txt | cut -d" " -f1 | sort | uniq -d | while read md5; do grep $md5 image_md5.txt; done
    ...though I think something more sophisticated than an md5sum would be wise (exif data could have been changed but nothing else, and you'd miss that dupe).
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by fisted ( 2295862 ) writes:
      
      pff, two commands. amateur... [slashdot.org]
    - Re: (Score:2)
      
      by TCM ( 130219 ) writes:
      
      How about only hashing files with identical file sizes?
- That's not a backup (Score:2)
  
  by dbIII ( 701233 ) writes:
  
  If it's attached to a live system and is writeable then it's not a backup yet, it's just a copy.
  A web hosting business near me went under because they made that mistake and lost all of their hosted data in a single incident.
  Copies on instantly available disk are often a lot more convenient than detached disks, tapes or whatever, but if that's all you've got there are plenty of ways to lose the lot.
- Re: (Score:2)
  
  by HiThere ( 15173 ) writes:
  
  You can't really depend on hashed to not put different keys into the same bin. Given md5sum, or some such, collisions won't be frequent, but they will happen.
  This may not matter. What's the cost of missing an image or two? If it's not large, then the small probability of a collision may be good enough.
  Exif is based on metadata, so the probability of an improper collision is probably less than for, say, md5sum. It's also mor e likely to recognize slightly different images as being the same. This is pro
  - Re:Hashes should be relatively easy (Score:4, Informative)
    
    by TsuruchiBrian ( 2731979 ) writes: on Thursday January 23, 2014 @10:25PM (#46053113)
    
    md5 is a 128bit hash. Assuming your not trying to create collisions, the odds of you getting a collision in n files is:
    p = 1 - (2^128)! / ((2^128 - n)! * (2^128)^n)
    This is an expression that starts at 0 and gradually goes to 1 as n goes to infinity.
    These numbers are so big, I have no idea how to even solve for n to get something like p = 0.0001%, without using a bignumber package, but I imagine n would have to be *REALLY* big in order to get a p significantly above 0
    
    Parent Share
    twitter facebook
    - Re: (Score:2)
      
      by wonkey_monkey ( 2592601 ) writes:
      
      And the probability of a collision and the files being the same size (the first thing to check when looking for dupes) is even smaller.
      And then you could pick a random section from both files and run an md5sum on that, squaring your probability of a collision. Probably. I'm just guessing.
    - Re: (Score:2)
      
      by DMUTPeregrine ( 612791 ) writes:
      
      An "easy" way to find a guaranteed collision is to simply create more files than 2^(bits in hash). So a bit over 3.4 x 10^38 files for MD5 and you'll get collisions on all subsequent files.
      This should be obvious, but just in case:
      If you can ask an oracle for a file with a hash not in a list of hashes, then you can keep adding the new files to the list. An n-bit hash can have 2^n unique values, so after 2^n files created no new value can possibly be added to the list.
      - Re: (Score:2)
        
        by TsuruchiBrian ( 2731979 ) writes:
        
        I wouldn't call creating 2^128 files easy. Also, you are likely to get collisions way before you get close to reaching 2^128 files.
  - Re: (Score:3)
    
    by TsuruchiBrian ( 2731979 ) writes:
    
    OK so I wrote a quick little python script (I just remember python has bignumber support) to do it on a smaller numbers.
    If we assume md5 was only 64 bits, even with 100 million files, your chances fo hitting an md5 collision are 0.03% (i.e. a 0.0003 chance).
    When you bump up the md5 to 128 bits 100 million files has a 0.000000 (rounded to 6 decimal places) chance of happening.
    maybe I will let my program run overnight and see how far it gets. It's programmed to count how many files it will take before the pr
  - Re: (Score:2)
    
    by TsuruchiBrian ( 2731979 ) writes:
    
    I should also point out that the convention for UUIDs (universally unique identifiers) is also 128bits. Meaning that the chances of randomly getting the same 128 bit number is so low that experts have determined it's ok to just assume it never happens for purposes of computing.
    http://en.wikipedia.org/wiki/Universally_unique_identifier
    BTW I am at 500 million files, and the odds of getting a 128bit md5 collision are still 0.000000
  - Re: (Score:2)
    
    by TsuruchiBrian ( 2731979 ) writes:
    
    I should also point out that when I said: " Assuming your not trying to create collisions...", I was referring to the fact that md5 has been compromised. My point is that 128 bits is enough bits to ensure you will not get a random collision due to chance if you are using a good hashing algorithm.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

write it yourself (Score:2, Insightful)

Re:write it yourself (Score:5, Informative)

Re:write it yourself (Score:5, Informative)

Re: (Score:3, Informative)

Visipics is excellent. (Score:3, Informative)

Re: (Score:3)

Re:write it yourself (Score:4, Interesting)

Re: (Score:2)

General case (Score:5, Informative)

Re:General case (Score:4, Informative)

Re:General case (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

I wrote one myself (Score:4, Insightful)

Write a quick script. (Score:5, Informative)

Re: (Score:2)

findimagedupes in Debian (Score:5, Interesting)

Re:findimagedupes in Debian (Score:4, Interesting)

Re: (Score:3)

SIFT is patented (Score:3)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Why use perl? (Score:2)

Re: (Score:2)

One liner (Score:2, Offtopic)

fdupes -rd (Score:5, Informative)

Re: (Score:2)

fslint (Score:3, Informative)

Fuzzy Hashing (Score:3)

Re: (Score:2)

I think I wrote one of these. (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Geeqie (Score:5, Informative)

Re: (Score:3)

Don't reinvent the wheel: fdupes, md5deep, gqview (Score:3)

Re:Don't reinvent the wheel: fdupes, md5deep, gqvi (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Anti-Twin (Score:2)

fdupes (Score:2)

Photo managers (Score:3)

Re: (Score:2)

Consider git-annex (Score:2)

DigicaMerge (Score:2)

Quick shell script using exiftool (Score:5, Interesting)

Use fslint or fslint-gui (Score:2)

File system lint FSlint (Score:2)

sigh.... (Score:2)

http://en.wikipedia.org/wiki/List_of_duplicate_fil (Score:2, Informative)

Follow the FBI's lead (Score:2)

Re: (Score:2)

Did this years ago (Score:2)

My solution (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

digiKam is what you want. (Score:2)

Obligatory (Score:2)

Sorry, no app for that (Score:2)

If all else fails.... (Score:2)

fdupes (Score:2)

Re: (Score:3, Informative)

Count hard links (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)