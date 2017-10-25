Oracle Engineer Talks of ZFS File System Possibly Still Being Upstreamed On Linux (phoronix.com) 71
New submitter fstack writes: Senior software architect Mark Maybee who has been working at Oracle/Sun since '98 says maybe we "could" still see ZFS be a first-class upstream Linux file-system. He spoke at the annual OpenZFS Developer Summit about how Oracle's focus has shifted to the cloud and how they have reduced investment in Solaris. He admits that Linux rules the cloud. Among the Oracle engineer's hopes is that ZFS needs to become a "first class citizen in Linux," and to do so Oracle should port their ZFS code to Oracle Linux and then upstream the file-system to the Linux kernel, which would involve relicensing the ZFS code.
One nice thing about ZFS not being in upstream is that it is currently maintained and updated separate from the Linux kernel.
Now, it would be nice to relicense ZFS under GPL so that it can be included in the kernel. But this should wait until the port is a bit more mature. Right now development is very active on ZFS and we have new versions coming out every few weeks; having to coordinate this with kernel releases will complicate things.
All this said, relicensing ZFS would definitely help Oracle redeem themselves a bit. After mercilessly slaughtering Sun after acquiring them, they have a long way to go to get from the "evil" side back to the forces of good.
Funny, I thought ZFS was very mature by now.
Getting it open and into Linux would result in perhaps some cross-pollination between OpenZFS and Oracle's official ZFS.
It's very mature, on Solaris. Linux has a different ABI to the storage layer, and different requirements on how filesystems are supposed to behave. So it's not so much a port as a re-implementation.
Likewise, does not the maintainer of, say, the TTY subsystem (just a random pick...) make active changes *between* release cycles, submitting their LAG to the various RCs?
Not to RCs. As I understand it the kernel is on a three month cycle, one month merge window and roughly two months of weekly RCs that are only supposed to be bug fixes. Otherwise you might get an undiplomatic response from Mr. Torvalds. Worse yet, many distros ship kernels much older than that and despite having "proper channels" bugs often go directly upstream with a resolution of "we fixed that two years ago, update... sigh, waste of time". So if you're not really ready for production use, being in the ke
Oracle is evil
... period. There is no going back.
More like âoecompletely ambivalentâ, not really the same as âoeevilâ.
At least thatâ(TM)s what I was going to say until I remembered the click-through mess they put in front of downloading the jre and jdk. Pure malice.
I don't believe this is Oracle's better nature or whatever; ZFS has to transition from Solaris to Linux because Solaris is dead.
It's really that simple. If Oracle can gin up a little excitement and maybe score some kudos then great, why not? But ultimately this has to happen or the official Oracle developed ZFS will die with its only official platform.
But this is Oracle we're talking about. I doubt they would GPL something because in their minds they'd lose control of it and allow the competition to exploit their code. After all, that's what Oracle has done itself to competitors like Red Hat. Aside from that, assuming they did GPL it, then it would immediately fork b
Good: it's about time. (Score:2)
Careful there (Score:2)
ZFS wants to live in a fairly specific configuration. It wants a bunch of drives, a bunch of memory, and not much competition for system resources. It's really a NAS filesystem, which is why there are no recovery utilities for it. If your filesystem takes a dump, you're SOL, hope you have a backup.
You can run it on a single drive on a desktop machine, but you are incurring a bunch of overhead and not getting the benefits of a properly set up ZFS configuration.
Re:Careful there (Score:5, Insightful)
ZFS wants to live in a fairly specific configuration. It wants a bunch of drives, a bunch of memory, and not much competition for system resources.
Except for the part where it works with 2 drives, on a system with 4GB of RAM and under constant heavy load just fine.
Precisely, a bunch of drives, or a RAID, starts at two drives.
Generally in computers it is best to go from "only 1 device" directly to "n devices" and not to waste time special-casing 2 devices, 3 devices, 4 devices.
Being pedantic here, but you are wrong, and there are circumstances where this matters.
You cam make a RAID1 array with one drive plus a failed (non-existent) drive. Hence the minimum is actually 1 drive, not two.
RAID, as defined in the original paper, involves data striping and striping cannot be implemented with less than 2 drives.
If you desire redundancy, RAID requires a minimum of 3 drives. A mirrored drive pair is not RAID, it is just mirroring.
Depends on what you mean by a drive. I have a horrible hard drive which was declared almost in its grave by SMART long ago. I made 2 partitions, run "software RAID1" across the 2 partitions , and store one final backup on it.
If it dies, nothing is lost.
Precisely, a bunch of drives, or a RAID, starts at two drives.
Actually you're more than happy to run it on 1 drive as well. There's nothing "precise" about the GP's assertion that ZFS wants a fairly specific configuration.
Config? (Score:2)
Are you doing Z+1? Or just striping with an L2ARC, which is nearly pointless? What's the areal density of the drives? 'Cause if you are using anything above 2TB the odds of getting uncorrectable errors on both drives becomes non-trivial.
At this point you are better off using XFS with a really good backup strategy.
So they say. Don't you find it odd that a drive can't possibly correct for errors but a filesystem can?
I wonder if drive vendors acknowledge that 100% of their high capacity drives are incapable of functioning without uncorrectable errors. Perhaps they should implement ZFS internally and all problems would be solved.
So they say. Don't you find it odd that a drive can't possibly correct for errors but a filesystem can?
That's because the filesystem can just write to a different spot on the device, but if a specific spot on the physical device goes bad it's bad. In fact, almost all drives automatically error correct, you can see the stats through utils like "smartctl". A drive generally has +10%-+20% of advertised capacity, and exports a virtual mapping of the drive. As sectors start to show signs of failing, the address is transparently mapped to some of this "extra" space and things continue as normal. It's only a drive-
Not the best fit since it's schizophrenic (Score:3)
> The problem with ZFS on Linux is that some aspects of it are redundant with the kernel.
Probably ALL aspects of it. Linux already has a raid implementation in-kernel. It already has filesystems. It already has multiple volume managers, which handle whichever type of snapshots you prefer. It already has IO schedulers. ZFS, or rather something that looks just like it, can be implemented as a few configuration lines for pre-existing Linux components.
Because Linux normally lets you use your choice of fil
That last bit is important. If ZFS doesn't have a way to put its hands into the RAID, it can't attempt to rebuild known corrupted data. Until mdadm and hardware RAID controllers allow you to issue a "read, but try to give a different result" operation you can't do this. (Said operation would attempt to use parity even on a healthy array in an attempt to give a diffe
"If ZFS doesn't have a way to put its hands into the RAID, it can't attempt to rebuild known corrupted data."
Nonsense.
"Until mdadm and hardware RAID controllers allow you to issue a "read, but try to give a different result" operation you can't do this."
More nonsense.
"(Said operation would attempt to use parity even on a healthy array in an attempt to give a different block content by pretending a disk is dead)."
Apparently you believe that redundancy information can't be checked unless hardware provides an
Oh it can be "checked" by RAID controllers. The question is, how do you know which copy is correct? In the case of a RAID-1, if the 2 disks don't have identical data, which do you assume is the right data? ZFS has checksums to figure out which is right. MDADM doesn't.
And if there is an API to allow you to ask for data from a specific disk rather than letting the RAID driver pick one, I'm interested.
Heard of RAID levels 2 through 6? (Score:2)
> ZFS has checksums to figure out which is right. MDADM doesn't.
You have no idea how RAID works, do you? Neither through the mdadm UI or any other.
RAID level 2 uses Hamming error correction codes.
Levels 3 through 5 use checksums much like ZFS does. Level 6 uses two independent sets of checksums, so even if you lose half your checksums, you're still okay.
>. if there is an API to allow you to ask for data from a specific disk rather than letting the RAID driver pick one, I'm interested.
An API to r
Again, wrong. RAID-2 might have ECC, but mdadm doesn't support it. You got RAID-1, 5 and 6 (4 is identical to 5 with parity being distributed rather than on a single disk). But that's not a checksum, it's parity. It recovers from a drive outright failing, not from errors returning incorrect data but not detected as bad. I have seen it happen. RAID-5 can only tell you there's an inconsistency, not which disk has the bad data. The RAID controller's consistency check usually just updates the parity under the a
Re: (Score:2)
It's called scrubbing, and RAID has always done it (Score:2)
> Until mdadm and hardware RAID controllers allow you to issue a "read, but try to give a different result" operation you can't do this. (Said operation would attempt to use parity even on a healthy array in an attempt to give a different block content by pretending a disk is dead).
So until the late 1980s? That's called RAID scrubbing and I believe it was mentioned toward the end of the original RAID paper in 1987 or 1988. Certainly 10 years ago I had a "mdadm check" command in my crontab. I know this
The ZFS architecture is a very well-factored, layered design with clear abstraction barriers between the different pieces, and it solves a single well-defined problem: managing your storage. Comparing it to systemd is a gross disservice to ZFS.
Sometimes it really works better to have a monolithic, well thought out abstraction providing some service as opposed to the kernel giving you a handful of ill-fitting subcomponents that you're responsible for gluing together. This is clear from the example of Linux
Re: (Score:3)
> Because Linux normally lets you use your choice of file system on top of your choice of volume manager, on top of whichever RAID implementation you choose, with your choice of IO scheduling options, ZFS isn't exactly the best fit. ZFS mashes all those different things into one big blob. That's not really how Linux is designed.
Criticizing ZFS for "rampant layering violation" has been discussed to death before [archive.org]
"Dumb" API's, such as the ones implemented in Linux, have a STRICT layered approach like this:
*
ZFS mashes all those different things into one big blob. That's not really how Linux is designed.
That's because Linux isn't designed, it's grown organically in a hodgepodge fashion. Some people think this is a good thing. Others do not.
A weblog post by Jeff Bonwich, one of the creators of ZFS, from ten years ago**:
Andrew Morton has famously called ZFS a "rampant layering violation" because it combines the functionality of a filesystem, volume manager, and RAID controller. I suppose it depends what the meaning of the word violate is. While designing ZFS we observed that the standard layering of the storage stack induces a surprising amount of unnecessary complexity and duplicated logic. We found that by refactoring the problem a bit -- that is, changing where the boundaries are between layers -- we could make the whole thing much simpler.
https://blogs.oracle.com/bonwick/rampant-layering-violation
He gives a reasonable answer as to why glomming all that together has its advantages. Good intro slide deck:
https://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf
Note that "ZFS" is actually made of of three layers: the SPA
> Because Linux normally lets you use your choice of file system on top of your choice of volume manager,
The problem is: btrfs, exfat, ext3, ext4, fat, jfs, reisderfs, and xfs ALL SUCK -- they all propagated write errors [danluu.com]
FS / read / write
btrfs.. | prop prop prop
exfat.. | prop prop ignore
ext3... | prop prop ignore
ext4... | prop prop ignore
fat.... | prop prop ignore
jfs.... | prop ignore ignore
reiserfs | prop prop ignore
xfs.... | prop prop ignore
As you may know, RedHat has deprecated BTRFS in RHEL7.4 [redhat.com] whereas many distributions like Ubuntu fully support ZFS [ubuntu.com].
I woud say that the status of BTRFS is worse [kernel.org] than that of OpenZFS on Linux. See also here [ixsystems.com] for an interesting article.
I would say you are wrong.
That RH has not retained qualified Btrfs programmers is their business decision and has little to nothing to do with Btrfs or its usability.
https://www.itwire.com/open-sa... [itwire.com]
KDE Neon User Edition has zfs-fuse and a version of OpenZFS in its repository. I've played with the fuse version and was unimpressed.
After I tried zfs-fuse I tried Btrfs. I've been using it without a single fault or problem for 2 1/2 years.
Whatever the reason, btrfs is not supported in production on RHEL. It has never been, it's always been in "preview" and will soon be out of the picture completely.
It's been going on for years so I would agree with the above that OpenZFS would have a brighter future.
Yes, there is a lot of duplicated code in ZFS for Linux, such as an SHA256 implementation, RAID parity, compression, and lately a whole crypto library.
The reason is either the kernel doesn't reliably support this natively or the implementation isn't usable. Linux doesn't allow non-GPL modules to access a lot of features (eg: the crypto library) or some features are version-specific (eg: LZ4 (de)compressor). The simplest solution is to import the Solaris versions.
But they've improved. SSE and AVX instruction
Re: (Score:2)
Holy shit are you serious? Like SERIOUS? OMG why don't we all switch to BSD! Everyone stop! I know Linux is *everywhere* but BSD has ZFS! Did you guyz know this????
The version in BSD is a older version derived from when Solaris was open-source, in 2007. It is independently maintained and a part of OpenZFS. In fact the ZFS stacks in IllumOS (a fork of open-source Solaris), FreeBSD, Linux and OS/X share a lot of code and are compatible, in the sense that if you create a ZFS filesystem on one of these OSes, it will work on the others.
OpenZFS has made enormous progress. I have been using it on my FreeBSD, Linux and OS X (macOS) boxes for over 3 years now.
And once it's in the kernel, Oracle will sue... (Score:2)
*cough*Java*cough*
Btrfs (Score:2)
I played with zfs-fuse on KDE Neon a couple years ago after reading from its acolytes that it was "more advanced" and "better" than EXT4 or Btrfs. It wasn't. A lot of it is missing in the fuse rendition.
I switched to Btrfs. I have three 750Gb HD's in my laptop. I use one as a receiver of @ and @home backup snapshots. I've configured the other two as a 2 HD pool and then as a RAID1, and then back to a pool again. In 2 1/2 years of using Btrfs I've never had a single hiccup with it.
There are some exce
ZFS fuse is not ZFS on Linux. Not sure why you'd pass judgement on ZFS having only used it years ago with the fuse version. If you want a real test, try the latest ZFS on Linux releases. They are kernel modules not fuse drivers.
I have run BtrFS for about 5 years now, and I must say it works well on my Laptop with SSD. However on my desktop with spinning disk, it completely falls over. It started out pretty fast for the first few years, but now it's horrible. The slightest disk I/O can freeze my system for
Quick question for you: do you have quota's enabled? Updating qgroups takes an enormous amount of time, I had the same symptoms on my laptop on a 1T drive, and turning of quotas and removing qgroups solved it.
New to ZFS (Score:3)
Just as this article popped up I was assembling a JBOD array (twelve 4TB drives) for a new data center project, my first in quite a while. Also self funded so I don't have to defer to anyone in decisions.
When I started I did a bit of reading trying to decide what RAID hardware to get. To make a long story short once I read the architecture of ZFS and several somewhat-polemic-but-well-reasoned blog entries I decided that is what I wanted.
Only two months ago I had an aged Dell RAID array let me down. I have no idea what actually happened, but it appears some error crept in one of the drives and it got faithfully spread across the array and there was just no recovering it. If I didn't have good backups that would have been about 12 years of the company's IP up in smoke. I just thought I'd share.
So I ended up as a prime candidate (with new found distrust for hardware RAID) to be a new ZFS-as-my-main-storage user. I've just recently learned stuff that was well established five years ago [pthree.org] and I can't understand why doesn't everybody do it this way.
Wow. snapshots? I can do routine low-cost snapshots? Data compression? Sane volume management? (I consider LVM to the the crazy aunt in the attic. Part of the family but
...) Old Solaris hands are probably rolling their eyes but this is like mana from heaven to me.
Given the plethora of benefits I am sure the incentive is high enough to keep ZFS on Linux going onward. ZFS root file system would be nice but I am more than willing to work around that now.
> Only two months ago I had an aged Dell RAID array let me down. I have no idea what actually happened, but it appears some error crept in one of the drives and it got faithfully spread across the array and there was just no recovering it. If I didn't have good backups that would have been about 12 years of the company's IP up in smoke. I just thought I'd share.
It may have been the RAID write hole ?
See Page 17 [illumos.org]
I have a similar configuration at home. zfs send/recv is a godsend for backups in that you can have all of the old snapshots sent as well as the current top level and it ship only the data that has changed not everything.
I have run this configuration where I have had controllers, power supplies, multiple drive (more than 2 at the same time) go bad and it still kept on chugging with no errors and full confidence in the data.