Forgot your password?
typodupeerror
Data Storage Linux Hardware

Open Source Deduplication For Linux With Opendedup 186

Posted by timothy
from the its-missing-apostrophes dept.
tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."
This discussion has been archived. No new comments can be posted.

Open Source Deduplication For Linux With Opendedup

Comments Filter:
  • by stoolpigeon (454276) * <bittercode@gmail> on Saturday March 27, 2010 @11:32PM (#31644772) Homepage Journal

    Data deduplication [wikipedia.org]
    ( I don't )

  • by MyLongNickName (822545) on Saturday March 27, 2010 @11:52PM (#31644882) Journal

    Data deduplication is huge in virtualized environments. Four virtual servers with identical OS's running on one host server? Deduplicate the data and save a lot of space.

    This is even bigger in the virutulized desktop envirornment where you could literally have hundreds of PCs virtualized on the same physical box.

  • Offtopic? (Score:4, Informative)

    by SanityInAnarchy (655584) <ninja@slaphack.com> on Sunday March 28, 2010 @12:18AM (#31645024) Journal

    If you'd mentioned the fact that this appears to be written in Java, you might have a point. But despite this, and the fact that it's in userland, they seem to be getting pretty decent performance out of it.

    And keep in mind, all of this is to support reducing the amount of storage required on a hard disk, and it's a fairly large programming effort to do so. Seems like this entire project is just the opposite of what you claim -- it's software types doing extra work so they can spend less on storage.

  • by dlgeek (1065796) on Sunday March 28, 2010 @12:23AM (#31645058)
    You could easily write a script to do that using find, sha1sum or md5sum, sort and link. It would probably only take about 5-10 minutes to write but you most likely don't want to do that. When you modify one item in a hard linked pair, the other one is edited as well, whereas a copy doesn't do this. Unless you are sure your data is immutable, this will lead to problems down the road.

    Deduplication systems pay attention to this and maintain independent indexes to do copy-on-write and the like to preserve the independence of each reference.
  • by rubycodez (864176) on Sunday March 28, 2010 @12:33AM (#31645110)

    hundreds of virtualized desktops per physical server does happen, my employer sells such solutions from several vendors.

  • by MyLongNickName (822545) on Sunday March 28, 2010 @12:51AM (#31645170) Journal

    If you have a couple hundred people running business apps, it ain't all that difficult. Generally you will get spikes of CPU utilization that last a few seconds mashed between many minutes, or even hours of very low CPU utilization. A powerful server can handle dozens or even hundreds of virtual desktops in this type of environment.

  • by Hooya (518216) on Sunday March 28, 2010 @01:06AM (#31645230) Homepage

    try this::

    mv backup.0 backup.1
    rsync -a --delete --link-dest=../backup.1 source_directory/ backup.0/

    see this [mikerubel.org]

  • by Anonymous Coward on Sunday March 28, 2010 @01:11AM (#31645244)
  • by zappepcs (820751) on Sunday March 28, 2010 @01:12AM (#31645250) Journal

    In a word, No. There are many types of 'virtualization' and more than one approach to de-duplication. In a system as engineered as one with de-duplication, you should have replication as part of the data integrity processes. If the file is corrupted in all the main copies (everywhere it exists, including backups) then the scenario you describe would be correct. This is true for any individual file that exists on computer systems today. De-duplication strives to reduce the number of copies needed across some defined data 'space' whether that is user space, or server space, or storage space etc.

    This is a problem in many aspects of computing. Imagine you have a business with 50 users. Each must use a web application which has many graphics. The browser caches of each user has copies of each of those graphics images. When the cache is backed up, the backup is much larger than it needs to be. You can do several things to reduce backup times, storage space, and user quality of service

    1 - disable caching for that site in the browser and cache them on a single server locally located
    2 - disable backing up the browser caches, or back up only one
    3 - enable deduplication in the backup and storage processes
    4 - implement all or several of the above

    The problems are not single ended and the answers or solutions will also not be single ended or faceted. That is no one solution is the answer to all possible problems. This one has some aspects to it that are appealing to certain groups of people. You average home user might not be able to take advantage of this yet. Small businesses though might need to start looking at this type of solution. Think how many people got the same group email message with a 12MB attachment. How many times do all those copies get archived? In just that example you see the waste that duplicated data represents. Solutions such as this offer an affordable way to positively affect bottom lines in fighting those types of problems problems.

  • by QuantumRiff (120817) on Sunday March 28, 2010 @01:19AM (#31645282)

    If you cut up a large file into lots of chunks of whatever size, lets say 64KB each. Then, you look at the chunks. If you have two chunks that are the same, you remove the second one, and just place a pointer to the first one. Data Deduplication is much more complicated than that in real life, but basically, the more data you have, or the smaller the chunks you look at, the more likely you are to have duplication, or collisions. (how many word documents have a few words in a row? remove every repeat of the phrase "and then the" and replace it with a pointer, if you will).

    This is also similar to WAN acceleration, which at a high enough level, is just deduplicating traffic that the network would have to transmit.

    It is amazing how much space you can free up, when your not just looking at the file level. This has become very big in recent years, cause storage has exploded, and processors are finally fast enough to do this in real-time.

  • by mysidia (191772) on Sunday March 28, 2010 @02:03AM (#31645490)

    First of all.... one of the most commonly duplicated blocks is the NUL block, that is a block of data where all bits are 0, corresponding with unused space, or space that was used and then zero'd.

    If you have a virtual machine on a fresh 30GB disk with 10GB actually in use, you have at least 25GB that could be freed up by dedup.

    Second, if you have multiple VMs on a dedup store, many of the OS files will be duplicates.

    Even on a single system, many system binaries and libraries, will contain duplicate blocks.

    Of course multiple binaries statically linked against the same libraries will have dups.

    But also, there is a common structure to certain files in the OS, similarities between files so great, that they will contain duplicate blocks.

    Then if the system actually contains user data, there is probably duplication within the data.

    For example, mail stores... will commonly have many duplicates.

    One user sent an e-mail message to 300 people in your organization -- guess what, that message is going to be in 300 mailboxes.

    If users store files on the system, they will commonly make multiple copies of their own files..

    Ex... mydocument-draft1.doc, mydocument-draft2.doc, mydocument-draft3.doc

    Can MS Word files be large enough to matter? Yes.. if you get enough of them.

    Besides they have common structure that is the same for almost all MS Word files. Even documents' whose text is not at all similar are likely to have some duplicate blocks, which you have just accepted in the past -- it's supposed to be a very small amount of space per file, but in reality: a small amount of waste multiplied by thousands of files, adds up.

    Just because data seems to be all different doesn't mean dedup won't help with storage usage.

  • by drsmithy (35869) <(moc.liamg) (ta) (yhtimsrd)> on Sunday March 28, 2010 @03:13AM (#31645736)

    I wonder how much this approach really buys you in "normal" scenarios especially given the CPU and disk I/O cost involved in finding and maintaining the de-duplicated blocks. There may be a few very specific examples where this could really make a difference but can someone enlighten me how this is useful on say a physical system with 10 Centos VMs running different apps or similar apps with different data? You might save a few blocks because of the shared OS files but if you did a proper minimal OS install then the gain hardly seems to be worth the effort.

    Assume 200 VMs at, say, 2GB per OS install. Allowing for some uniqueness, you'll probably end up using something in the ballpark of 20-30GB of "real" space to store 400GB of "virtual" data. That's a *massive* saving, not only disk space, but also in IOPS, since any well-engineered system will carry that deduplication through to the cache layer as well.

    Deduplication is *huge* in virtual environments. The other big place it provides benefits, of course, is D2D backups.

  • by DarkOx (621550) on Sunday March 28, 2010 @07:10AM (#31646372) Journal

    It really is hundreds, on a modern nehalem core system with 64 gigs of memory or so. We used to do dozens on each node in a citrix farm back in the PIII days.

  • by Z8 (1602647) on Sunday March 28, 2010 @12:25PM (#31647976)
    Yep, and then you don't have to worry about
    • Changes in permissions/mtimes/atimes corrupting all your old backups because all of them are hard linked, or alternatively
    • Changes in permissions/mtimes/atimes causing an entire file to get copied

    There are also other things to worry about. To be fair, the guy who invented --link-dest wrote a backup program called Dirvish [dirvish.org] so that is a better comparison to rdiff-backup [nongnu.org].

As long as we're going to reinvent the wheel again, we might as well try making it round this time. - Mike Dennison

Working...