The Linux Filesystem Challenge 654
Joe Barr writes "Mark Stone has thrown down the gauntlet for Linux filesystem developers in his thoughtful essay on Linux.com. The basic premise is that Linux must find a next-generation filesystem to keep pace with Microsoft and Apple, both of whom are promising new filesystems in a year or two. Never mind that Microsoft has been promising its "innovative" native database/filesystem (copying an idea from IBM's hugely successful OS/400) for more than ten years now. Anybody remember Cairo?"
Hans Reiser's vision of the future (Score:5, Informative)
Recall that Mark Stone... (Score:4, Informative)
ReiserFS is pretty damn good (Score:5, Informative)
Re:Is it? (Score:3, Informative)
Why????? (Score:3, Informative)
Any good (XFS, JFS, ext3) filesystem now has nice feature called Extended Attributes which is intented for STORING such a data (like previews etc.). And using user-space server it's much more easier to add plug-ins for various file-formats, "search" plugins etc.
Re:I want a transparent filesystem/VM (Score:3, Informative)
PR & tech journalists to the contrary, that is all that is involved in Spotlight & WinFS. Spotlight runs on HFS+. WinFS runs on NTFS. Both are databases stored as files on existing filesystems. The only difference between those databases & updatedb is that they may be using better database design (dunno) and that they update in real time via background processes.
I'm wrote a journal entry [slashdot.org] guessing as much about Spotlight, but since then more evidence has arrived, and I'm convinced that both WinFS & Spotlight are implemented that way. The features & implementation details are quite different, but not the filesystem.
We'll probably eventually start calling these databases a part of the "filesystem" much like right now some people will call mspaint.exe & bash a part of the operating system.
Um, Reiser anyone? (Score:2, Informative)
But seriously, even though he mentions Reiser, he doesn't seem to consider it's future [namesys.com] direction, which is to allow varying degrees of structure, that could include attributes, as the user sees fit. At least that's how I understand it.
dtrace (Score:5, Informative)
It is, however, expert driven, unlike top, which is simple to use. Still, I think that dtrace shows the furture of performance monitoring apps.
Note that dtrace lives partially in the kernel - it's not portable to Linux.
Re:Next premise, please (Score:3, Informative)
hfs+ [apple.com] supports a journal (starting with macos 10.2.2 server and 10.3 panther), and ntfs5 [microsoft.com] supports a journal (starting with win2k)
Re:Why not use... (Score:1, Informative)
BeOS was a great technology demo, but it had a huge way to go to become suitable for general-purpose, everyday use. NeXT had Be beaten in many of its strong spots (real-time scheduling? Mach already does that. well-designed, object-oriented system APIs? Openstep^WCocoa creams Be's APIs.), and was already a mature, field-tested operating system to boot. If Apple had bought Be, (a) they wouldn't have got Steve Jobs back to save the company, and (b) they would have a *lot* more "reinventing the wheel" to do to the BeOS base than they've had to do to Nextstep.
Re:New FS (Score:5, Informative)
Notice the plugin feature. This will create endless possibilities for what you can do with the file system. Want to tie a DB/SQL search function in to it? Write a plugin, want special security? Write a plugin. Tons of possibilites with ReiserFS4 and it is _very_ fast. This is hands down better then the MS "a filesystem as a DB" approach. ReiserFS4 will be like Firebird, lean-n-mean-n-fast. Want more features, grab _your_ favorite plugins!
Re:Encrypted filesystems? (Score:2, Informative)
http://bob.plankers.com/other/linux/loopback_ef
The Linux Doc Project also has a HOWTO in their archive:
http://www.tldp.org/HOWTO/Loopback-Encrypted-Fi
You will want to check around though, a lot of the information appears to be very old. Also, the 2.6 kernel has a lot more encryption routines built into it, so using 2.6 changes how it's done. (but it still is basicly mounting an encrypted file using a loop-back mount point)
Re:Encrypted filesystems? (Score:3, Informative)
Re:Hans Reiser's vision of the future (Score:5, Informative)
HFS+ is the current OS X file system, and that of Tiger (next revision of OS X) as well. Spotlight uses HFS+'s built-in metadata support to enhance it's search capabilities. What Tiger offers more to application developers is an API to add metadata to documents, something that was limited until now.
Re:Next premise, please (Score:5, Informative)
What are you talking about? NTFS has had journalling for over a decade. And Unicode. And ACLs. And streams. And reparse points (these are amazingly cool). And compression. And encryption. And
Now, MS doesn't use most of this good stuff, but it's all in there. Even three-letter file extensions on Windows are obsolete, since everything on NTFS can be an OLE server. There's nothing on Linux that comes close to the capabilities of NTFS. About the only major thing NTFS is missing is versionning, which VMS has.
been there, had that (Score:3, Informative)
The seamless filesystem-in-a-database was created in the Multi-Valued DB structure [multivaluedatabases.com] in the mid-60's and release as the the Pick OS [wikipedia.org]. It is still sold by Raining Data [rainingdata.com] and runs on Windows, Unix, and Linux.
Re:why not improved ramdisk? (Score:5, Informative)
The solution would be to load things "on demand," as you've suggested.
Linux already does this, and it does more.
If you've ever looked at the output of free(1) after your system has been running for an hour or so, it will appear as if almost all your memory is in use. See those last two columns, "buffers" and "cached"? That's your "on-demand ramdisk" at work.
Linux will use memory that applications aren't using to cache filesystem data (including executables and metadata) to speed future accesses. If your applications need more memory than is currently free, the kernel will drop cached data rather than swap out application memory to disk. That way, you get the benefits of having your executables on a ramdisk, with the flexibility of not having to sacrifice running application performance in the process.
Linux encrypted filesystems not really up to snuff (Score:3, Informative)
First, it's minimally supported by distros. I can't just set up a Fedora system out of box, and check "use encryption" and have it do an NTFS-style decryption of the file encryption key using the password entered at login for each user to decrypt that users' files. It requires hacking around pam and maybe initscripts.
Second, if that *was* done, it would take a different filesystem per user (per key), which is a pain to maintain.
Third, it can't be enabled by users (would require root dicking around with pam and filesystems) as NTFS encryption can be.
Fourth, it can't be enabled on-the-fly (requires creating new filesystems and copying the contents over, unlike NTFS).
Fifth, it's a pain to maintain -- on NTFS, it's easy for a user to just say "I want the contents of this directory and below to be encrypted" and choose to have things encrypted on a per-directory basis. The equivalent on Linux would be having the root user be creating new filesystems (knowing the appropriate sizes in advance and wasting any excess space allocated) copying over the contents and adding mount points for every filesystem mounted.
Sixth, NTFS supports key recovery with a backup, emergency passphrase (it can maintain two copies of the encryption key, one encrypted with, say, the administrator's password). Dunno about the Linux status of this.
Having an encryption layer above the block layer is a nice idea, but it's not a drop-in substitute for encryption support in the filesystem.
It would be possible to add a layer in which an encryption layer could be *added* (preprocess file/directory contents -- if one *only* wanted encrypted files and not directories, this could already be done with an LUFS or fuse module). Space for such a layer does not currently exist in Linux.
Re:easy answer (Score:4, Informative)
This is very much like saying "the future of filesystems is apache2, local filesystems are already good, now we have to concentrate on apache2".
Re:Hans Reiser's vision of the future (Score:3, Informative)
What Tiger offers is a way for application developers to DECLARE metadata in their document formats... most formats have metadata of some kind already (in an mp3, id3 tags; in a image, resolution etc.; in a source file, dependencies and exported symbols); what Tiger lets application developers do is tell spotlight how to find the information that's already there.
Now, this may lead to future formats that have more comprehensive metadata, since there's now more power to that metadata... but that's not the direct idea.
Re:Indestructable is the killer app (Score:4, Informative)
Offsite backups are your friend. No matter what your filesystem's software, or the coolness of your raid array, or your battery-backed redo-logs; if a fire or a burglar takes your disks holding your filesystem you're hozed.
Personally, instead of a raid, I do a nightly "rsync" to a "yesterday" drive on a separate server (hense protecting myself from stupid-user failures as well as filesystem/disk failures); a "every time I did something significant" rsync to an encrypted filesystem removable drive kept in my car; and a "once in a blue moon" copy to DVDs in a safe.
An added benefit - upgrading an OS, or a computer is trivial, because the live backups are just that - live, and tested every night.
(Back to the filesystem topic, Reiser's whole naming idea is so much cooler than a heirarchy or a relational system I really hope this is the next big advance for Linux).
Re:is this sarcasm (Score:3, Informative)
Re:Next generation? (Score:1, Informative)
Where I've worked for 15 years, I solved that problem the first day at work. It's called yellow pages from Sun. I've also used it on Linux since 1995 without problem.
Re:New FS (Score:3, Informative)
http://sourceforge.net/projects/e2compr/ (Ext2 Compression)
http://squashfs.sourceforge.net/ (Squashed - Read Only, don't know what that means)
S
Re:New FS (Score:2, Informative)
On the flip side, I've had multiple such snafus with XFS, but no filesystem failure. I've never even had to approach having to deal with trying to fix the system, as there's been no events which have resulted in fs corruption. Sure, power has gone out, but the machines have come back up again without a hitch.
Apple does NOT have a new FS coming out. (Score:4, Informative)
Their solution is to build a service that can interact with individual files, including their native metadata (ID3 tags, pdf metadata, MS Office metadata, email headers, etc.) through metadata importers and to store the metadata indexes in a separate database. This is relatively similar to how iTunes does it's thing. The services will have lots of APIs open to apps to incorporate the functionality locally.
The obvious clue that HFS+ isn't going away is that Apple is finally pushing full HFS+ support back up to the command line utils like cp to support resource forks and whatnot in 10.4, so hopefully we can stop needing OS X specific tools like ditto.
They've been adding improvements steadily over the years, such as journaling and most recently case sensitivity. The more obvious question to me is why doesn't the Linux community just jump all over HFS+ and build off of Apple's work since they seem more than willing to give the HFS+ support back anyway?
Re:New FS (Score:2, Informative)
It doesn't sound as though "compressed streaming format[s]" are what you're really looking for, and AVI isn't a streaming format in any case. However, there are archival-type video codecs that may suit your needs:
In a perfect world, you'd have one of these working behind the scenes in some sort of network storage device in a manner similar to the dpsVelocity [leitch.com] VTFS [creativecow.net]. If you haven't worked with an editing system that uses VTFS, I recommend getting a demo.
Re:Next generation? (Score:3, Informative)
UNIX has traditionally been about big systems with multiple users. Networks have been a standard feature for decades. In this sort of environment, you'd naturally use some network-oriented naming service, be it NIS or LDAP.
Windows has grown from a PC background where everything is traditionally local. In a networked environment there is little need for the MACHINEA/user when there is a DOMAIN/user (some exceptions obviously exist).
I am responsible for a network consisting of an NT domain and a number of Solaris, AIX, Linux and Unixware servers. All the *NIX boxes have the same UID/GID schema because we use NIS; not the most secure solution, but suitable for our environment. We interface those users easily with Windows (and Samba) because we only administer two sets of login credentials - NT and NIS (we could do just one using winbind, but that doesn't seem right...).
The UNIX filesystem permissions schema is easy to understand and it works extremely well. Commercial UNIX has had access control lists for years (part of the POSIX standard), but I'm not aware of anyone who uses them in the real world. They are potentially useful, but most people find the UNIX UID/GID does the job well enough for 99% of the time.
Re:New FS (Score:1, Informative)
Re:New FS (Score:3, Informative)
If you consider "streaming" to mean something like RealMedia or other web-based streaming codecs, you are correct. However, working in the DVD/Digital Video/Multimedia fields, we do refer to MPEG-2, AVI, and so forth as a "streaming" format because it is composed of one or more "streams" of content. Basically, the different between what we have now (tens of thousands of individual files, each one representing a single frame of video) and a "streaming" file is that it compresses all those individual files into one big file.
However, there are archival-type video codecs that may suit your needs:
Thanks for the listing! I will check into these.
Re:Hans Reiser's vision of the future (Score:4, Informative)
= 9J =
Re:Keep it all modular, please (Score:2, Informative)
Well, filesystems are more or less some kind of database.
Especially the Reiser (3/4) filesystems come very close to being a database.
The database is one big tree. You can see it as (in SQL view) like a single table, where the primary key is indexed and the actual data (the objects) can be of different types.
These types are:
The root directory (/) has a known key, where it can be looked up. There you'll find a "directory" item. It contains a list of names, each name also has a key. Using this key you can find the stat data for that file or directory list or the actual file data.
This data can be located anywhere in the tree, even small parts of file content (like the end of files that don't fill up a block so it would be a waste of space to store it in a full block).
Using this approach everything becomes dynamic. And also very fast because if you have a lot of file, you can write all the data into a contiguous region on the disk and don't have to update some fixed positions on the disk.
Now, reiser4 takes this approach to an extreme:
The clue is:
These default plugins make a filesystem from the database. It's just like reiserfs3 now, just faster.
BUT: You can now add plugins if you want. Plugins to store compressed or encrypted files. Plugins to store additional metadata alongside the files. It's basically the file system of the future. Because it's extensible without changing the disk format.
Re:not so fast ... (Score:5, Informative)
Sorry, but you are wrong here. Reiser4 is atomic and you can pack as many operations into one transaction as you like, you just have to use the reiser4 system call. This is, because there is no standard system call for atomic filesystem transactions. Modern filesystems are databases, build to store files and query them trough filenames, reiser4 is the first filesystem where search path can be done through plugins, therefore you can index everything you want.
Re:New FS (Score:5, Informative)
Re:New FS (Score:1, Informative)
Re:New FS (Reiser4 has a compression plugin coming (Score:5, Informative)
Hans
(You can email edward@namesys.com for details).
Re:not so fast ... (Score:2, Informative)
P
Re:Apple does NOT have a new FS coming out. (Score:3, Informative)
Because the features they're adding to HFS+ are already available in other filesystems? There's nothing in HFS+ that would make linux users want to use it, and some compelling reasons why they would not. (Performance, size limits, lack of an online resizer, etc.)
Re:Trust the Kernel team (Score:2, Informative)
I didn't, I looked around until I found something more stable. It turned out to be Slackware using ReiserFS on an AMD900 with "cheap RAM" and I have had 0 problems in the year and a half It's been running (and it's my main desktop PC/workhorse).
Sometimes it is just the hardware.
Re:New FS (Score:4, Informative)
Re:New FS (Score:3, Informative)
>event I edit
Relax, this is old news! The Novell Netware filesystem did this 10 years ago, they called it "sub-block allocation". I never had a problem with it on my servers. I've never heard of anyone else having problems with it, either.
Re:New FS (Reiser4 has a compression plugin coming (Score:4, Informative)
if the filesystem does the compression, the apps (or you) can't see it happen. that's the POINT. your suggestion, above, is ridiculous. If you had a tar.gz file, you could extract it to the FS, but it would actually be equally compressed (cause it's a gzip compressed FS), and then you could play with the files to your heart's content, without worrying about the compression, cause it's transparent. You wouldn't need or want some kinda plugin or something...
Unless the FS wasn't compressed, and you wanted a transparent way to access tar.gz files. That idea would make sense.
Re:not so fast ... (Score:3, Informative)
Re:Trust the Kernel team (Score:1, Informative)
If you've had no problems, you're either using a version of ReiserFS which has had the problems fixed, or you're lucky.
And it's not just hardware that changes the picture. It can be delicate kernel interactions. It can be the actual data of what's on disk exposing bugs in the code that handles it. Such bugs aren't necessarily exposed on ALL running instances of the code, but for some users, it might just be triggered.