Tux2: The Filesystem That Would Be King 252
Phillips, an expatriate Canadian now employed by Berlin-based Innominate AG, claims 25 years of computer programming experience. He's had stints in everything from database design and game programming to embedded controller system development, and in a dual life which may sound familiar to many computer programmers, Phillips worked through music school by hacking Fortran code. With that background, perhaps it's unsurprising that just a few years after first encountering Linux, and a year from joining the ranks of the kernel hackers (there's a +5 informative thread in Zack Brown's excellent Kernel Traffic), he's come up with what could be a sea change in Linux filesystems.
A Filesystem You Can Live With And Pull The Plug On
The central point of a journaling file system is that in exhange for a small hit in performance, file integrity is assured by an ingenious mechanism: rather than being written directly (and riskily), filesystem changes are instead first recorded sequentially in a running list -- the journal -- the contents of which are then acted upon in turn. If the system should crash for any reason while a change is not yet accomplished, the recovery time upon reboot is greatly abbreviated, as long as this "edit decision list" remains intact. Journaling file systems are on the way from multiple projects, and rather than being theoretical, wouldn't-it-be-nice daydreaming, at least one is availble right now: the ReiserFS developed by Hans Reiser is even an option at install on some recent Linux distributions.Why another, then? Wrong question: Tux2 is not a journaling filesystem. Phillips says that Tux2 offers Linux users the chief advantage of a journaling filesystem (namely, keeping files safe in the event of a system crash) but without a journal, and does so more efficiently.
"The big deal is when you compare it to journaling, which is a popular solution, and you see that it's just plain writing less blocks. That's a big savings. It's also not constantly going back to wherever the journal is on the disk to write to the journal, so there's a lot less seeking involved. So those two things together means that it should significantly outperform journaling." Perhaps more importantly, Tux2 is not actually a wholly new filesystem per se; it shares so much in common with ext2 that it is built as a patch to ext2, with the filesystem converted at runtime. How does Tux2 get around keeping a journal to do the things that a journaling filesystem does? Atomic updates are the key. (See also: soft updates) Instead of a journal, Tux2 uses what Phillips terms a "Phase Tree algorithm."
"I originally called it Tree Phase," he says, "and then Alan Cox mentioned it on the Linux kernel list. He called it Phase Tree on the Linux kernel list, and I decided I liked that better." The Phase Tree algorithm is simple at heart, but takes a little while to grasp -- at least it did for me. Happily, Phillips has written a lucid tutorial on his own site. Probably the best explanation is the one found on Phillips' project site: the exerpts which I found most illuminating are these:
All accesses to filesystem data are performed by descending through a filesystem tree starting at its metaroot.Normally, three filesystem trees exist simultaneously, each with its own metaroot. One is recorded on disk with a complete, consistent tree descending from it. A consistent second tree, the 'recording' tree, in the process of being recorded to disk, descends from a metaroot in memory, and some of its blocks are in dirty buffers. A third tree, the 'branching' tree is in the process of being accessed and updated by filesystem operations, also with its metaroot in memory. The branching tree and is not required to be internally consistent at all times. In particular, some blocks that are free in the branching tree may not be marked as free in its block allocation maps but held on a 'deferred free' ('defree') list instead.
At some point the recording tree will be fully recorded on disk and its metaroot can be written to disk so that it replaces the metaroot of the recorded tree. This causes the filesystem to move atomically between states, as desired. At this point, the recording tree becomes the recorded tree, and the branching trees metaroot is copied to become the new recording tree. This event is called a 'phase transistion' and the interval between two such events is called a 'phase'.
"The problem is, it's not nice to block filesystem transactions. If you're using a KDE desktop or similar, you find your desktop moving in a very jerky way while the blocks are getting written -- no good. That's why we make another tree by copying the metaroot -- that's how we always start, we never start one by going up the tree -- meanwhile this second tree is undisturbed by that and can be written to the disk in peace."
This additional copy allows the user to work without noticing a system slowdown, while the intermediate branch is copied. Thus, there are always three "trees," and in the event of a system crash, recreating the system's correct state is as easy as identifying the latest succesfully written tree. "Each new tree is always incremented higher, so this is easy," Phillips says.
"There are a couple of other places where [Phase Tree] is obviously better than journaling. For instance, removeable media -- your removeable media is usually slowest, and you don't put a journal on it, because if you did, it would be really, really slow. So you put the journal on your hard disk, and the data on the removeable media. As soon as you pull your removeable media out, you have instant corruption, because you've removed yourself from your backup. Phase Tree doesn't do that -- you can just pull out your removeable media and you have something current up to the last tenth of second, quarter of a second."
Sleepless nights and database integrity
Phillips' work with Phase Trees began a decade ago, when he implemented a system with similar functionality for a specialized database called Nirvana which he had developed on his own. "I would have implemented this on a Unix filesystem at the time as well, except I didn't have one available."Was there a Eureka moment in 1989? "Oh yeah. I dimly recall having a a week of sleepless nights, tossing and turning, trying to figure out if it was even possible to do something abot this, and eventually convinced myself that it was. And as I recall, it was quite tricky to get it to a hundred percent state, not 99.99. I could smell the idea in there, but I couldn't find it's actual realizaton for some time. After that, the generalization of its application to a general file system is pretty obvious."
Still, the idea stayed with him until he realized it would be an interesting way to improve the performance of Linux systems.
Like the puzzle with square pieces sliding around a single missing square, only scant disk resources are used to accomplish the extra data's movement because the information is moved incrementally -- in blocks rather than all at once. That means, says Phillips, that "It really adds very little [disk] overhead. Something on the order of 1 percent."
Additionally, it has one more feature which may appeal to the fsck-hater in you: "Really, it's nearly a defragmenter already," Phillips says. "It would be trivial to add that functionality."
The dual advantages of lower overhead and -- most importantly -- a close relation to the ext2 file system should make it an easier transition for most users. Tux2 is actually built as a patch to the ext2 filesystem; standard ext2 filesystems are converted to Tux2 at mount time. According to Phillips, that conversion takes on the order of a tenth of a second per gigabyte on a typical system.
Fly In The Ointment
Though Phillips downplays their significance, patent difficulties may lie ahead for Tux2 as well. Network Appliance applied for a patent in the early 90s which covers similar ground -- a few years after Phillips had implemented it in his database."What really steams me in this is that their [patent] application came three years after my invention," says Phillips. "I hate to use the word infringe, because that makes me sound like the bad guy -- but it seems as though my [method] doesn't infringe beause it uses a different algorithm. In fact," he says, "I've got two things: I've got prior art, and I've got a better algorithm ... We can fence them in [legally], so their best strategy is to be nice, but they haven't figured that out yet."
"I don't want to suggest that NetApp got the idea from me -- I don't think they did, I think they developed it independently. The only little problem is the chronology of it. I concieved the whole thing, essentially everything that they've written in their patent, so I was kind of upset when I saw it. I would have gone on to do in on a Unix file system at the time, if I'd only had one available. We know it's stupid, but you see people patenting things all the time on the web -- just because it is a business idea that is now being done on the web." The approach that Phillips has to the dispute is to simply keep working. "I don't want it to become a distraction, I just keep doing what I'm doing."
Do penguins have calendars?
Phillips says that Hans Reiser has approached him regarding integrating the file protection capabilities of Tux2 with the additional features of ReiserFS. "But it's pretty obvious where the priority has to be," he says, noting that ext2 is the default file system, and isn't going away any time soon. "Ext2 is what everyone has by default, and that's too big to ignore."Does Phillips anticipate Tux2 becoming the default file system in Linux systems? "Well, who knows what's going to happen?" he laughs. "It could. But you can be sure of one thing, Tux2 will live a fairly long life as an independent patch that people apply, and I will be the 1st to apply it. But sure, of course I'd like that."
With a caution that fits someone whose last job was in embedded controls, Phillips warns against putting Tux2 in too soon: "It has to be proven, it has to be 100 percent. Because that's the whole point of this, is to 100 percent. So I think any bug which is not an ext2 bug already is just not acceptable."
And ultimately, like any other possible low-level change, "It's up to his high penguiness." Besides which, "it's quite clear what the next Linux filesystem standard is going to be. Well, it's my opinion that ext3 is going to be the most popular standard linux filesystem next year. And a couple years after that, well, I certainly will be using tux2 all the time, and we'll see where it goes."
The current status is heavy development: "I want to give it as a Christmas present to myself and start using it in my root system for my own development," says Phillips, "as soon as I port it to [the 2.4 kernel]." Soon after that, the code will be released to the developers on the Tux2 mailing list which Phillips has been assembling, who will work to make a public release in the months that follow, a process which Phillips says will likely take six months to a year.
"There is a prototype for kernel 2.2.13. I'm not going to release it -- I have my reasons for that, and the main reason is that the amount of cleanup to make it presentable to the public is roughly the same as the amount of work I have to do to bring it to [a newer kernel]. Probably if I'd done nothing else but worked on it for a couple of months, I'd be using it now, but I've done a few other things [in those months], like change from an industrial control systems job where they wanted me to do the next version of the control system in Windows NT to a nice linux job where I can hack the kernel."
Does this have anyone else itching for 2.5?
Ordered writes possible with normal HDs? (Score:1)
How, in general, can the ordering of the atomic processes (block writes to the HD) be guaranteed? The caching strategy of most HDs is probably completely invisible to the OS. The only possibilities I see right now would be to disable caching on writes, or to flush the drive's cache between operations that need to remain ordered. Wouldn't this imply a significant performance penalty? Or do IDE/SCSI HDs actually provide a good mechanism for ordering write operations (other then disabling write-back cache)?
I'm aware that journaled filesystems have to cope with the same problem. I'm just wondering.
READ THE ARTICLE (Score:1)
"Unfair" to the idiots who modded you up.
Thank you for your time.
Routine FUD (Score:1)
Re:Pulling the plug (Score:1)
Re:Is this filesystem immune to the "rhnsd factor" (Score:1)
A new spiffy filesystem can't change the fact that your hard drive is partitioned. Within each partition lives a filesystem. You can't just simply change partition sizes on the fly. I don't know how Partition Magic works, and half the people I've heard from who've used it said it wiped their hard drive.
--
Re:Nice algo, but.. (Score:1)
So what make you feel that a journal necessarily has any more current and consistant information then a tree that's 2 phases out at the point in time of a disaster?
-Peter
Re:Pulling the plug (Score:1)
Using a data and metadata log can make some operations faster, but you do suffer with sun's ufs+logging the problem of having to write to the log, then copy the log to the filesystem. This can be an issue in an environment where the filesystem is used for a lot of transactions.
IMO I believe that tux2 can do better in situations like this because of the elimination of an unneeded copy.
Also note that sun's ufs filesystem has long been horribly, horribly slow relative to it's cousin, the bsd FFS. I'm going to guess that when adding the logging feature, sun's developers decided it could be seen as an opportunity to add other speed enhancements besides just logging.
-Peter
Even more surprising omission! (Score:1)
Re:MIME types... (Score:1)
Oh, and HPFS fragmentation is negligable (sp?).
By the way, did anyone hear of HPUFS? I hear the FreeBSD guys are working on it, seems to be an interesting project too.
Re:One evil due to the Linux infrastructure. (Score:1)
Actually, using anything on the same disk is a generally poor idea. When you're low on memory, it's generally because something is putting a lot of stuff there. Where does this stuff come from? Generally, disk. So you're pulling stuff in from disk and writing other stuff out to disk. This creates a major performance bottleneck. Whenever possible, I use a swap disk (or even disk chain, if there's one not in use), which is clearly going to be a fixed size.
You are wrong. You are ignorant of the issues. (Score:2)
First of all: your swap file in win2k does not magically change size. In fact, what actually happens is very similar to the mechanism of making a new swap file. When Windows is running out of memory it allocates a large continugous block and adds it to the VM. So it DOES seriously fragment the memory, since you have what really amounts to perhaps a dozen different swap files.
Second of all, what the above poster described would not require any down time at all. The data from the "old" swap file does not have to be copied into the "new" one. The new one simply has to be created and added to VM. The kernel can certainly handle more than one being active at a time.
Third of all: this has nothing whatsoever to do with the filesystem, be it FAT, NTFS, or ext2. This is a direct vm->disk interaction.
Thank you for your time.
Re:Version control system -- CVS/Podfuk (Score:2)
After you have a CVS GMC VFS library (Go-Go-Gadget-TLAs!) you can use the excellent podfuk [mff.cuni.cz] to instantly allow you to use the CVS archive as a filesystem!
No, magic numbers are the way. (Score:2)
To associate a program with a filename (why would you want to do this? It's backwards), you can do it at the filemanager level. And I believe that you're wrong in believing that you want to open up files with the same program that made them... I want to make images with the gimp, but I want to view them with xv or xloadimage or ee.
What happens on a Mac when your little four-letter-codes have a collision? What happens if two programs have the same app code?
AFAIK, on the mainstream unix filemanagers you can configure what program opens what kind of file, but there is a default for each file that is supported.
The magic database is ubercool. Learn it. Love it. Use it. MacOS- and Windows-style file-type resolution sucks. As you said, extention-based types suck. But keeping creator info / 4-letter file codes (the mac way) sucks, too.
Re:One evil due to the Linux infrastructure. (Score:2)
Also, dynamicly creating and removing swap files (or extending and shrinking them) is going to cause your filesystem to become massivly fragmented very quickly, causing many multiples worse performance then the already horrendously nasty and unreasonable performance of having to page out 600MB, then do something, then page it back in.
What kind of glutton for punishment are you that you want to do this to yourself?
Actually, can you give an example of an application that really does require this kind of swap? When you run a large database you try to pin its memory (shm, cache, and table buffers) into active memory so it can't be paged out. Graphics rendering systems would be slowed down many orders of magnitude if this much data had to be swapped per frame.
The only reason I can think of for the need for 600 MB of swap in most systems is because of an application with a lot of leaking memory. Please let slashdot know if you've got another reason
-Peter
Re:What is wrong with Reiserfs? (Score:2)
That being said, the day that 2.5.0 gets released, there will doubtless be a flurry of activity to get ReiserFS in there, as well as to backport it to 2.4.1 or 2.4.2. If there be further political disputes at that time, there will doubtless be considerable flaming. There have been some pretty dramatic flames surrounding ReiserFS already...
As for the focus, or lack thereof, resulting from introducing Tux2 as an additional option, I think this is entirely a healthy thing.
I doubt that all of ext3, XFS, JFS, ReiserFS, and Tux2 will prove "totally successful." On the one hand, if one of them became dominant, that would effectively "shut out" the others. On the other hand, it's not likely that all of them will be considered equal, at the end of the process.
Reality is that a couple of them are likely to become very popular, and the others are likely to eventually languish unmaintained.
At first blush, that sounds wasteful. I don't think it is. I think it a very good thing that a bunch of groups are independently trying out some differing approaches to filesystems. This allows any to individually "succeed" or fail without resulting in Disaster For Linux.
As with Gnome versus KDE versus GnuStep versus Berlin, the different systems can learn both from each others' successes and from each others' mistakes.
As with many projects, there would not necessarily be benefit to trying to conglomerate these all into One Big Project; that certainly can lead to unworkable bureaucracy.
I'd rather see five attempts that try radically different approaches to "reliable fast FSes," and see a couple provide tangibly useful results than for them to try to cooperate more than they successfully can, and risk having NO journalling filesystem at all.
Pulling the plug (Score:2)
Re:Don't forget the cache (Score:2)
Or something. I sure am making this up as I go, without knowing much at all.
Re:can user processes schedule phase transitions? (Score:2)
If you did this, you wouldn't save your
What you want is not to write your
Norton Filesave (Score:2)
A TSR intercepted the "delete-file" and "how much space free" calls.
If the really free space went too low, the programa would really delete files. So it was transparent.
Something similar was put into MS-DOS 6.0 and in OS/2. If Unix doesn't have it, I call it a shortcome. But Unix was never designed for fallible beings (hence, case-sensitive filenames).
__
Just at the command prompt (Score:2)
What's needed is a program intercepting every call to the "delete file" system call. It has been done on DOS.
__
Re:TYPE & CREATOR CODES (Score:2)
Perhaps you've been wandering in the Unix world too much. Can magic-based systems distinguish English plain-text from German plain-text? Somebody could find it useful.
A possible solution without external metadata would be an in-file header like XML and HTML, but I find it cumbersome.
__
file and OS/2 (Score:2)
OS/2 Rexx scripts must start with a Rexx comment.
Like:
a= 1
[...]
In spite of this "magic bytes", file can't distinguish them from C or C++ header and code files (.h,
And file can call "English text" things that are not text and are not English.
IBM OS/2 does implement (on FAT and HPFS filesystems) Extended Attributes. You have up to 64 kB associated to any file, where you can store attributes (official or your own), for example, URL where I downloaded it from, date of creation, date of last read, date of last update,...
One of the official attributes is type, you can label a file as "text", "OS/2 command file", "DOS command file". You can even assign your own type.
Some OS/2 programs (not those ported from Unix) can use them to ignore extensions.
OS/2's Workplace Shell works both with extensions and file types. You can assign
It is not perfect, because many programs ignore extended attributes. But I think it's a good idea.
BeOS did it better because it has no limit to the size of the attributes.
__
Re:A bit OT question from non-hacker (Score:2)
Some file systems that support access control lists (NTFS and the Solaris version of the BSD file system, I think) give directories a second access control list which is the ACL to give to files created in that directory. (I seem to remember that Multics had this - I think it originally had "common ACLs" for directories, which were combined with the ACLs of files in the directories to give the ACL that gives permissions for access to those files, and that those were replaced with an "initial ACL" of that sort.)
Re:Don't forget the cache (Score:2)
Probably a wise decision - I'm not sure I'd trust write-caching disk drives not to lose data on power failures. I suspect many OSes simply tell the drives not to do their own write caching.
You can get crash-proofness from ext3 now (Score:2)
Phase tree filesystems sound like a better way to do this, but you don't have to wait. Get crash-proof today.
Re:There are serious problems with this idea (Score:2)
I would like to see a system where the file permissions, the file name, the date, everything is stored in the data in the file. In your attempt to disagree with me I think you reinforced my position.
I think there is some work being done on this. Permissions are controlled by the parent directories as well as the file, since you are allowed to set the permission and user of your file to anything you want with this.
There are serious problems with this idea (Score:2)
In fact there is absolutly no reason for this information to be stored in any way that the OS sees. The data is only used by user-level programs (for instance a file browser that selects what program to launch).
Another problem is that the id space gets used up quickly and then only commercial software vendors who talk to the official Linux ID assignment consortium can make new IDs. With magic bytes in the file, if there is a collision, you just make a more complex test for distinguishing files that looks at more bytes.
The biggest problem is that there is zero chance that once you add this database feature that there will not be dozens or even thousands of new id/value pairs added to the system, and dozens of standards for encoding these so that files can be copied. I would much rather force everybody to use simple files and thus get all this mess into user space.
Personally I feel that everything about the file, even it's name, could be stored in the data somehow, though I'm not sure how. Some of the ReiserFS stuff is looking at this, I think, since the Unix overhead of name/permission/date is larger than most of the files they want to use.
Re:There are serious problems with this idea (Score:2)
In most cases I expect the format to be flexible enough that the data can be hidden in a comment in existing file formats. A good example is the "#!" syntax used by executable scripts in Unix.
Re:Patents? This algorithm was published in 1977 (Score:2)
Recursion could also be reduced by maintaining allocation information near the data in the inode/file tree. Since a transaction only involves cloning blocks and saving the resultant allocation changes, it would be possible to localize an entire transaction at a node fairly far down the tree. The only reason that the metaroot has to be cloned and updated now is that the allocation table is linked there.
i.e.
The current tree structure:
metaroot
allocation table
subtable 1
subtable 2*
inodes
inodes 2
inodes 3
index block
index 2
Changing data2 above would require changing something in, say, subtable 2 above. Since changing = moving, we recurse up (from the nodes marked with "*" above), generating cloned blocks with new pointers, until we identify a common parent node which can be updated in *one* atomic operation. In the above picture, that one block is the metaroot.
If instead we have something like this:
metaroot
inode1
allocation1*
index1
index2
data1*
inode2
The "*" nodes have a common parent at the inode1 block. We clone data1 into a new block listed in allocation1, and clone parent nodes until we find ourselves at inode1. At this point, we can flush the cloned blocks, and then overwrite inode1 with pointers to the new subtrees in one atomic operation.
The idea in doing this is that free space tracked at each node would be close to the data at the node, so that data locality would be maintained, or at least helped. I'm not sure how well space could be managed in such a framework, however.
Re:Sounds great, but BFS... (Score:2)
This was all rewritten, I believe, before any public (non DR) releases got to the general public...
Sorry I can't remember much off the top of my head, but it's been a while since I wrote an app for a Be release before the FS rewrite.
Here's some more info:
http://www.68k.org/mirror/BeBook_DR7/StorageKit
Re:Sounds great, but BFS... (Score:2)
anyway, this page has a short description of the API side of it at least:
http://www.68k.org/mirror/BeBook_DR7/StorageKit
in any case, the current BeFS still has some DBesque traits, in that anything in the FS can have attributes, and the filesystem supports queries for those attributes. check the new stuff out here:
http://www-classic.be.com/documentation/be_book
WOW (Score:2)
Re:Not Comfortable... (Score:2)
E.g. journalling filesystems such as XFS perform the same constant-time check routine every time they start, to inspect and clean up the journal (and commit transactions, if necessary) in case we crashed last time. The journal may not be complete, and some modest amount of data may have been lost, but the filesystem is not corrupt.
Re:TYPE & CREATOR CODES (Score:2)
Magic Numbers are a good idea, but they are far from perfect. Like it or not, every once and a while, the magic number will be the same for multiple type of the files. Or a single type of file will have many different magic numbers (as they start differently).
Another problem with magic numbers, is app's can't own a certain type of file. For example, if you created a file with GIFconverter, you would probably want that app to open it, and not another one (that might be selected as the default for that file system).
That said, Magic Numbers do work fairly well, but they aren't a be all end all solution to file to app matching.
Re:this is not going to make microsoft happy... (Score:2)
OK, it is technically a journalling filesystem, but when an NTFS partition corrupts so badly that a journal replay won't fix it and you have no option but to reinstall... I would hardly call that commercial quality. Or even a real journalling filesystem.
"Free your mind and your ass will follow"
Group commits (Score:2)
The point is that you can group writes based on transactions and performance, or based purely on performance - Oracle and some journalling filesystems do the former, Linux and others do the latter.
In both cases you end up with a long sequential write to the journal file - certainly Oracle claimed a big speedup in transactions per second when this was introduced.
Re:No, magic numbers are the way. (Score:2)
"the magic database". A central repository of every last file format known to computing. How charmingly quaint.
Re:WOW (Score:2)
MP3.com effectively entrusts its entire business to it. And backups of course. Once you can verify a backup, you should be able to restore at least the data (if not fs-specific metadata) to another filesystem in case it goes tits-up somehow. And the fact that it's open source means that while you personally might not have the skill to fix any such bugs (I sure don't), you can at least try to hire someone outside the original manufacturer who can.
I would trust my data to ReiserFS more than any new version of NTFS or FAT.
Re:Volume size limitations? (Score:2)
I sure hope you meant filesize. Otherwise you could have a 2 gig drive with a real big superblock and nothing else on it
Re:TYPE & CREATOR CODES (Score:2)
--
Americans are bred for stupidity.
Re:Version control system (Score:2)
I think in general the conclusion was that it's not worth the trouble. It's not much harder for a user to run checkout.
I don't think it's ever been done to the extent you describe (although as other people have mentioned VMS apparently stored old versions of files) because while reading could be done reasonably quickly, wirting is slow (you have to run diff every time), a nuciense to implement, and uses a lot of space. Imagine the added cost of version control on (say) your mail spool. Or if you overwrite a large file.
Wouldn't it be nice.... (Score:2)
Re:Version control system (Score:2)
What you could do is create manual 'check points' or snapshots. By default all disk writes go to an 'undo' log. The 'real' data is elsewhere on disk. When a file is requested the OS first looks for writes to that sector in the undo log and returns that data if present.
At any point you could blow away the undo log and go back to the previous state. You could also 'commit' the undo log, writing it to the 'real' data area of the disk and starting with a fresh undo log.
One could also imagine keeping more than one undo log. But this might get space prohibitive.
AFAIK, vmware supports something like this with its virtual disks, you can chose to rollback all disk transactions when you shut down the virtual machine.
-josh
Re:Volume size limitations? (Score:2)
I sure hope tux2 is capable of at least 2^31 or 2^32 * blocksize. I don't see why it wouldn't be. OTOH, if it's cleanly enough written, you should be able to redefine a few macros and have a version capable of 2^63 or 2^64 * blocksize, and with larger blocks, too.
Other limitations on size, such as limiting a single file to 2GB tend to be more a problem with API's trying to conform to standards (you have to be able to address a location in a file via the API to the byte level, including negative values for relative seeking), and using the variable type called off_t, which probably could not have been equated to the type long long (though the new C99 [ipal.org] now makes that a standard one).
From what I recall, ext, which came before ext2, meant "extended", and probably refers to going upwards from the rather limited minix. The filesystem in use by BSD may have at that time still been a licensing issue.
Re:Version control system (Score:2)
Eric
Re:*BSD SoftUpdates provide crash resistance NOW (Score:2)
Re:That command would be 'purge' I believe... (Score:2)
How about making "gb" (garbage) or something a short script that moves files to a trashcan location, say
It would be fairly trivial to extend this to
$ echo "Jeans" >/home/david/pants
$ gb
pants trashed.
$ more
Jeans
$ echo "Baggy" >/home/david/pants
$ gb
pants trashed.
$more
Baggy
$more
Jeans
Have a ugb command to reverse the delete.
Yes, the idea needs work, and sounds space wasting, but HD space is cheap these days.
Re:I am not wrong. You didn't read close enough. (Score:2)
2) If you know the difference between windows & linux virtual memory, why do you think that Tux2 has *anything whatsoever* to do with swap partitions? It's a filesystem... not a partition.
3) One of the first 'recommendations' for tuning Windows servers is to lock the size of the swap file so it *doesn't* resize, as resizing causes fragmentation and hence, slowdown.
4) Adding additional swap files to linux has no worse an effect than enlarging the swap file on windows.
5) I said nothing about *partitions* I said swap *FILE*. Linux can do BOTH.
Downtime? Creating a new swap file takes seconds, and causes *NO* downtime whatsoever; activating it is virtually instant, with a single command.
You are confusing two issues. (Score:2)
1) Windows does not use a swap 'partition', it uses a swap 'file'. Linux can use either. And if you use a swapfile, you cannot necessarily resize it on the fly, but you can make another one and add it in.. effectively the exact same thing.
2) How linux deals with swap (be it file or partition) has *nothing* to do with Tux2, Ext2fs, NTFS, or any other filesystem. They are not related in any way whatsoever.
3) if you find your swap partition is too small, simply make a swap file and mount it, on the fly, to add additional swap space.
Re:Sounds great, but BFS... (Score:2)
What did you mean by that? How was it a database? Inquiring minds want to know
Re:Version control system (Score:2)
Re:Not Comfortable... (Score:2)
For an array where you've promised 99.99% uptime (an hour a year), you simply can't check it like that. You wait until you can upgrade the array to new hardware that you can start with a fresh filesystem on.
For the less extreme circumstances, it's still nice to be able to plan downtime for this. That way you can schedule it to automatically happen Thanksgiving day instead of when someone trips over the power cord.
And yes, you are correct that having filesystem integrity does not necessarily mean you also have file integrity. You can't do much about that unless you go the VMS route of keeping versions of files around.
off_t (Score:2)
In order to enable fseeko/tello, you should compile your code with -D_LARGE_FILESOURCE, which will give you a default (currently 32bit) off_t. If you add -D_FILE_OFFSET_BITS=64, then off_t will be 64 bit, and fseeko/tello will be redefined as their 64 bit cousins. These definitions are part of the LFS standard.
glibc6 already has 64bit support, but of course you also need a new kernel (2.4) to get the >2GB support. AFAIK there's no 2.2.x backport.
BTW Mandrake 7.1 has a buggy stdio.h that doesn't support _LARGE_FILESOURCE (I believe it's been fixed in a more recent version). You can use -D__USE_UNIX98 instead to enable fseeko/tello support.
Re:Version control system (Score:2)
It is a nice feature, and it is only as space intensive as you want it to be. By default, there is no limit to the number of backups, but using set file/version_limit=2 foo.bar you can limit it to (automatically) 2 versions on the disk. The count is always incremented...so you can have foo.bar;32 and foo.bar;33 and when a change is made, foo.bar;34 is created and foo.bar;32 is erased.
There's been times where this would be nice on unix. Didn't RMS put VMS-style versioning on the list reasons why a new OS was needed when the HURD first appeared?
Re:TYPE & CREATOR CODES (Score:2)
> I don't want to open it up with whatever application
> created it to start with?
its easy to change the file extension by renaming the file. likewise, its easy to change the file TYPE & CREATOR with a command-line or gui 'Get File Info' type of command.
> Try renaming, for instance, a JPEG file to have a
> extension -- and xv handles it fine.
>
> Why?
>
> Because the first few bytes in the file conform to what is
> expected of a JPEG. Open one up -- and there's a header
> inside. It really DOES NOT CARE about the extension.
>
> And, this is much saner than altering the filesystem...
this just formalises the process and makes it more reliable than depending on the not always reliable scanningn of the first few bytes of a file. its faster to read the type & creator off the directory than to scan the first few bytes of the file itself - you don't have to open it for a read then.
this technique has been used successufully for over ten years in the mac's HFS and HFS+ filing systems - so its realiability (of this one technique - not the whole OS itself!!) has been proven to be effective in elimng the need for a registry.
regards,
john.
Patents? This guy works in Berlin. (Score:2)
That command would be 'purge' I believe... (Score:2)
I got my introduction to the Internet -- email, Usenet, FTP, you name it -- on a VAX running VMS. I dearly missed the versioning filesystem when I moved to Unix, especially when I discovered that when you type 'rm' you had better by damn mean it.
Then I discovered that Unix actively encourages exploration and experimentation, whereas VMS seems to place many obstacles in the way. I never looked back. :-)
Re:No, magic numbers are the way. (Score:2)
Apple maintains a database of all creator codes. Each creator code is a 32 bit number. You can search their database to see if the one you want is taken. If it's not, you simply request it and it is given to you.
Re:Version control system (Score:2)
This is what net app has, I think.
Here at CCS, I believe our NFS needs are served by a net app server. Whatever it is serving them, it does automatic snapshots every hour, so that at any point you can access the
So not quite a version system, but mighty cool.
And really useful. It lets you dive in with exploratory programming (as long as you for the top-of-the-hour so the stable code is snapshotted), cause restoring your files is as easy as cp.
I agree it would be cooler if you could request a snapshot of this tree now please, but systems informs me that this is not possible/too much hassle.
Re:*BSD SoftUpdates provide crash resistance NOW (Score:2)
Furthermore, should others report a bug before I suffer data loss, I can revert to plain old boring ext2 by just editing my fstab. Now that is a feature you don't get with other journaling fs (ext3? I'm not sure).
Re:WOW (Score:2)
How are the version updates to the fs code? Do you just recompile the kernel, or do you have to buy a new disk and copy the data from one format to the other?
I'm constantly stepping on the powerstrip and flicking main cutoff switch (don't ask -- small appartment) and killing my fileserver. Fsck of 25 gigs is no fun. It's getting to the point where I'm about to repartition and use apmd just to avoid rechecking all of it.
yet to have ext2 lose me any info, tho.
Re:Version control system (Score:2)
I had completely forgotten about that.
Reminds me of the std "blue sky" storage solutions where there is no distinction between cache/volative/persistent storage; each level is merely a faster cache for the next level down.
Has anyone made this work, on a performance basis, or is it inherently blue sky.
BTW, go ahead and mod up this parent (post #57).
Re:Version control system (Score:2)
Re:*BSD SoftUpdates provide crash resistance NOW (Score:2)
this is not going to make microsoft happy... (Score:2)
From the Microsoft website: [microsoft.com]
"Linux lacks a commercial quality Journaling File System. This means that in the event of a system failure (such as a power outage) data loss or corruption is possible. In any event, the system must check the integrity of the file system during system restart, a process that will likely consume an extended amount of time, especially on large volumes and may require manual intervention to reconstruct the file system."
I wonder what Microsoft will say if Tux2 takes off?
*slower* than ext2? (Score:2)
Re:Version control system (Score:2)
First, however, we are going to do the data integrity.
VMWare works at the block device level for its rollback system--all dirty blocks on a device are stored in memory while VMWare is running, then at shutdown you can either discard them or flush them. While it allows you to bail out of a horked session, it makes no guarantees about data integrity while the block device is actually being written to.
wheel reinvention (Score:2)
At the risk of waking the Linux and BSD zealots and trolls, why do this at all? The view from above has always made ext2 look very much like a middling attempt to reproduce what UFS already did very well, while at the same creating a lot of installed ELFish Unix boxes that are annoyingly incompatible with it. I should be able to boot Linux, FreeBSD, or even Solaris from the *same* filesystem, just as I used to boot different Suns from the same disk.
Why not just finish porting a reliable UFS implementation, incorporate the (now firmly BSD-licenced) softupdates code into it, and make that the default filesystem for new Linux installations? You can still have XFS or some other journalling system where it makes sense, but let ext2 die in peace. The only dubious benefit it ever offered over UFS was the dangerous performance increase from willy-nilly asynchronous writes, and now that's not an issue (both softupdates and phase tree can be nearly as fast while also being safe).
You might be able to teach an old dog some new tricks, but it's just sick to try and do the same with a dead one.
Re:Wrong: Desktop Database (Score:2)
The Mac's four-letter types and creators work very well, but they're unnecessarily cryptic by current standards of disk space, memory, and CPU register width. The BeFS uses MIME types for filetypes, but doesn't record creators in the filesystem AFAIK (it uses an external database like the Windows registry - correct me if I'm wrong!). What I'd really like to see in some upcoming filesystem is a more flexible scheme that can store an arbitrary number of tags, so you could flexibly encapsulate whatever metadata came with the file and is relevant to the OS - including all of the filesystems discussed above (none of which can completely describe a file from the others).
Re:There are serious problems with this idea (Score:2)
Actually, permissions are necessary for insecurity - if you didn't have them, people would just be limited to working with their own files and never be aware of any others. I've used systems that worked this way (some VMS installations, the ancient THEOS). Permissions, as the name implies, allow you to relax security below that unsharing level and give other people a peek.
But your main point is entirely correct and well-made. See some earlier spiel [slashdot.org] by me for one suggested solution I've been mulling over, an extensible tagged metadata format within the filesystem (actually HFS+ could be theoretically capable of this, but no one has ever used the third file fork AFAIK).
Re:Version control system (Score:2)
Uh, don't you have to do the data integrity first? If you didn't have a guarantee that the only things that could be wrong were unallocated blocks and inodes, working from a metadata snapshot could be painful (or just a waste of time).
Re:Is this filesystem immune to the "rhnsd factor" (Score:2)
Labelled a troll for pointing out accurate areas in which RH and Mandrake distributions look like they were rolled by rookies.
Re:Use a LRU replacement scheme... (Score:2)
Re:Version control system (Score:2)
1) You select which files are version-controlled. Most of the files on a fs shouldn't be.
2) You history is compatible with other version control systems. (and can be remote,
Re:Version control system (Score:2)
IMHO, it's a very bad idea. I don't know about speed, the problem is about capacity loss. Imagine, it basically means that you cannot delete files from the drive. Simple operation: download source code, untar, compile and install, delete source code. That (now useless) source code wil live on your drive forever. Also, when your disk is full, it's full, and it'll stay full until you guy a larger one. Anyway, you get the idea.
You could have a RC filesystem that has a "real delete" option, but then why not just use CVS, as for most of the files, you don't want revision control.
(OT)Directory permissions (Score:2)
So that if I put a file in my www_docs, it'll be 644, if I put it in a directory where several people help editing web pages, then it gets 664, my personal stuff is 600, and so on.
It's possible on any filesystem that supports POSIX permissions (not FAT32). All you have to do is write a shell script to do chmod -R on the directories in question.
e2fsck good (Score:2)
Mike
"I would kill everyone in this room for a drop of sweet beer."
Re:Routine FUD (Score:2)
Re:Pulling the plug (Score:2)
Linux already does this. It's called the "elevator" mechanism. It does exactly what you say, and it has nothing to do with journalling.
A bit OT question from non-hacker (Score:2)
LOL (Score:2)
<em>Linux lacks a commercial quality Journaling File System.</em>
The nick is a joke! Really!
Nice sentiments, but... (Score:2)
One demonstration that we used to do on a regular basis to show the power of our crash recovery in a Progress application was to pull the plug on a Xenix machine, mid transaction ! In hundreds of demos, the worst issue we had was a power-supply that started to make "odd" noises.
Now if you backup your system whenever you make changes, and you distribute your file systems over multiple platters, and ensure that the crash recovery processes are in place, you will be fine.
I welcome crash recovery tools, and even file-systems that do not shit theirselves if you "pull the wrong plug", but simple things like labels that say "Do not pull this plug", and UPS devices, even battery backed cache's on disk controllers, veritas file systems, RAID 10 mirrors etc all help, and negate the need to develop this kind of stuff.
FWIW my Linux boxes have never screwed their filesystems, they have many of the above precautions implimented, but even then, there are no issues.
Now if you want to invent a new filesystem, look at change control, look at saving OS files that have changed and easy go-backs, look at mirrors. Oh most of that can be done already.....
./nf
Re:Don't forget the cache (Score:2)
Re:this is not going to make microsoft happy... (Score:3)
That's a pretty funny quote, especially since NTFS is not a journalling filesystem.
What a pack of liars. I don't see how those Marketing guys can look at themselves in the mirror.
"Free your mind and your ass will follow"
Not Comfortable... (Score:3)
The Tux2 filesystem project has the following goals:
[SNIP]
Eliminate the need to perform fsck after an interruption
[SNIP]
If I was saving a file, and my computer decided to take a shit and die on me, I'd want to run an integrety check on the file system whether it's stable or not. If not for anything, but my own sanity. I mean, you were in the middle of saving a file. If that was a large file, and the computer died.....well, logically, the saved data should be recoverable. However, experience says that the file would most likely be corrupted.
Stable filesystem or not, I'd still be running a filesystem check. (When Windows 95 died on me, I ran scandisk as soon as it was finished booting - even before OSR2. Just to be SURE everything was cool.)
-- Give him Head? Be a Beacon?
You still need an fsck program. (Score:3)
1) Bad block takes out part of your disk unexpectedly.
2) Your OS screws up and spews a mess onto your filesystem before it crashes. (there ARE bugs in the kernel!)
3) You have a minor headcrash which takes out one of your tracks, but the disk is still functional.
What're you gonna do? Tux2 isn't gonna help you.
You could restore your latest dump. You could
also attempt to repair the filesystem.
You need fsck or some other means of filesystem repair.
Re:Not Comfortable... (Score:3)
All the complex mechanisms behind the filesystem ensure that, if the FS thinks a file is there, then it *IS* there, period. If the power was yanked halfway through writing a file, it simply won't be there.
In the case of a Journalling system, this works because, instead of a fsck, you simply look at the journal. If there is stuff there, you know what hasn't been written (and now can't be, cause you crashed) and you can make the appropriate adjustments.
In the case of phase tree, it's even simpler to check: it appears to work something like... the new trees are written backwards, root last.. so if the root is htere, the write is complete. If it's not, you don't see it anyway!
TYPE & CREATOR CODES (Score:3)
TYPE & CREATOR CODES
i really hope they use this excellent opportunity to
be able to get rid of REGISTRY TYPE TRACKING once and
for all.
basically all those little three letter extensions
that are used to keep track of the file type like
.txt
if you simply make one extra entry in the file directory
system (in addition to filename, date, block pointers) itself:
TYPE & CREATOR -- then you will never again need to keep track
of file types externally by a sort of 'Registry' file.
so, if you have a text file, you don't need to put
on the end of it, simply, you would have the type and
creator of the file set to: 'TIFF' and '8BIM' which would
mean that its a TIFF file, and it should be opened by
photoshop if in a GUI you go and double-click it.
this approach makes it much more difficult for any
accidental SEPARATING of the file type info from the
info that determines which app should open it - and thus
makes the user-experience and OS less prone to error and
frusteration.
it would be simple to add - if only someone bothered to
put it in now - while the system is being determined.
please consider this.
regards,
johnrpenner@earthlink-NOSPAM-.net
Re:Don't forget the cache (Score:3)
This is terrible! (Score:3)
Sounds great, but BFS... (Score:4)
Still, this beats the pants off of FAT
Re:Version control system (Score:4)
ex: README.TXT;4 would be version 4 of README.TXT
There's a command you can type to purge all but the 'x' most recent versions, but I don't remember what it is, as I'm actively trying to forget I ever even used VMS. Anyway, you could really eat up some disk space if you didn't run this command every so often.
I always found the versioning to be a pain in the ass to deal with, but I guess it did come in handy occasionally. I think the negatives outweigh the benefits though.
--
Re:Patents? This algorithm was published in 1977 (Score:4)
Put your hand in the puppethead
Correct Link to "Tux2" (Score:4)
Appears to be this:
http://innominate.org/~phillips/tux2/ [innominate.org]
Two els
Another case... (Score:4)
Of course there should always be system integrity checks available to the user for the paranoid among us (scandisk, fsck, etc)...
But one would imagine a properly designed computer system has the capability of *never* having corrupted data! The machine would be pointing out to the user that FileA.ext was lost due to problems, and that the user needs to check on the integrity of the data, or that the data seems to be okay, does the user want to double check, or that nothing seems to be wrong.
It's like... driving your car to the grocery, and then checking the oil, air, gas, transmission fluid, and brake fluid. The analogy is broken because the car didn't die, ala Windows, but it should be that the machine should be smart enough to tell you when something is wrong. I think.
The nick is a joke! Really!
Patents? This algorithm was published in 1977 (Score:5)
Tux2's reliability algorithm essentially goes as follows:
1. At the beginning of a transaction, the "metablock" (including the block allocation table) at the root of the filesystem tree is copied into a buffer.
2. Whenever a block in a file is updated, the updated image of the block is written to a newly allocated block, and the "new" metablock is updated with the new allocation. Blocks pointing to the old block may also be updated, in recursive fashion, eventually copying and updating an entire subtree from the original. The blocks in the "old" subtree are marked as free in the new metablock. The newly allocated blocks can live in memory, but must be written to disk before commit.
3. At commit time, the new subtree replaces the old one. This operation simply involves overwriting the original metablock with a new one, which contains pointers to the new subtree as well as to the other subtrees which have not changed. If this operation does not complete, the complete picture of the old metablock, the old subtree linked to it, and free blocks where the new subtree was written, is maintained. If the operation does complete, the new image of the filesystem with the new, updated subtree, and free blocks where the old subtree used to be, is obtained.
This is a good algorithm, and it's the only way to achieve atomicity and reliability without any logging, but it does have a few tradeoffs. Each update necessitates allocating a new block, so, for instance, changing one byte in the middle of a 2G, contiguous file will require allocating a block at least 1G away (and putting a hole where the old block was). There is also a ripple effect as pointers are updated up the tree, so changing one byte of data may will mean cloning a block, then cloning the blocks that point to the block, and so on up to the root.
Re:can user processes schedule phase transitions? (Score:5)
While that would be nice, using it to create a system snapshot for backup would be even nicer. You could tell all of your applications to write everything they need to write and freeze temporarily. Then you start a backup application as a transaction to your filesystem and it gets the frozen snapshot while your apps are unfrozen and work merrily away. The unfrozen applications see all their updates, and the backup sees the frozen filsystem.
can user processes schedule phase transitions? (Score:5)
# tux2_transaction && make && make test && make install && tux2_commit
and then if there was a power failure in the middle of the build, I wouldn't have a build directory half-full of compiled files.
On the other hand, I'm not sure how useful this would be; it would be easy (I assume) to defer phase transitions for an entire file system until a moment convenient for the superuser, but it could degrade performance for all other users on the system, and to get around that problem, you'd have to do all the grunt work of implementing a multi-user relational database within your file system.
--
*BSD SoftUpdates provide crash resistance NOW (Score:5)
(78 sec total vs 125 sec Linux 2.2.14) to make sure that data was going to disk.
In all four cases I ran, the fsck upon repowering was fast, minor and automatic, mostly freeing unattached blocks whose metadata presumably wasn't fully written at powerout. More surprising, in three of the four trials, `make -j 4` _resumed_ the compile and as best as I could tell completed the interrupted kernel compiles without error. (Same ksize. md5 doesn't work because of timestamp) About 30-45 seconds
worth of data was lost in dirty buffers at poweroff. In the fourth case, I got compile errors, but only had to `make clean`.
I am seriously impressed. I've had poweroffs during Linux kernel compiles and had manual fsck work to do. There some info at Kirks's site http://www.mckusick.com/softdep/index.html
and there's a very interesting paper whose URL I don't have handy.
Version control system (Score:5)
Do you follow me? Is this a good idea? Has it been done? Too slow? Etc..
a.