Ask Slashdot: How Reliable are Enormous Filesystems in Linux? 145
Josh Beck submitted this
interesting question:"Hello. We're currently using
a Mylex PG card and a pile of disks to run a 120 GB RAID5 under
Linux. After some minor firmware issues with the Mylex
(which their tech support acknowledged and fixed right away!)
, we've got a very stable filesystem with a good amount of
storage. My question, though, is how far will Linux and e2fs
go before something breaks? Is anyone currently using e2fs
and Linux to run a 500+ GB filesystem? "
Josh continues...
"I have plenty of faith
in Linux (over half our servers are Linux, most of the rest are
FreeBSD), but am concerned that few people have likely
attempted to use such a large FS under Linux...the fact
that our 120 GB FS takes something like 3 minutes to
mount is a bit curious as well, but hey, how often
do you reboot a Linux box?"
Files over 4 GBs (Score:1)
When will we (do we) have support for files larger than 4 gigabytes ?
Files over 4 GBs (Score:1)
When do we get a fully journaled file system? ... (Score:1)
The only way journalling can work reasonably well is if you have a battery backed RAM to hold the journal.
3.5 MB/s is really slow. (Score:1)
At any rate, 3 MB/s seems too slow, by a factor of 3 or so.
- A.P.
--
"One World, One Web, One Program" - Microsoft Promotional Ad
My God.. (Score:1)
- A.P. (Yes, I've had several bad experiences with Exabytes.)
--
"One World, One Web, One Program" - Microsoft Promotional Ad
SW Striped 100GB + FS at VA.... (Score:1)
There are a number of well-known websites which utilize Linux, including Deja News [dejanews.com]. Not sure what kind of partition sizes they're using, but it would be fun to know.
FWIW, you can modify the reserved % parameter using tune2fs rather than mke2fs and save scads of time. You can also force an fsck (man fsck) to time the operation if you want.
Logging Filesystem (Score:1)
There is a log-structured filesystem for linux called "dtfs" available at their home page. [tuwien.ac.at] The author tells me he will be shooting for inclusion in 2.3.0 and that the bulk of it is working just fine.
LHS sells big'uns too... (Score:1)
I prefer the ICP-Vortex GDT line of RAID controllers -- there's even a fibre channel model that works fine with Linux. Leonard Z is a great guy, but I like supporting vendors who support Linux -- ICP-Vortex wrote their Linux driver back in 1.3 days, supports it, and even all of their utilities run native under Linux (none of that bull about booting to DOS to configure the RAID array).
Interesting thing about ext2: mounting a 120gb partition takes about 3 minutes if you mount it read/write, but it's almost instant if you mount it read-only. Apparently it has to pre-load all that meta-data only if you intend to write to the filesystem.
e2fsck'ing that beast took over ten minutes (I don't know how much over 'cause I gave up). Formatting it in the first place took about five to eight minutes, so I aborted my e2fsck and reformatted the partition (this is while I was doing system setup and configuration, so there wasn't any data on it).
We can go up to half a terabyte without going to an external cabinet, using a solid heavy-duty steel California PC Products case and CRU LVD hot-swap backplanes rather than that effete gee-whiz stuff that's flimsy and breaks easily. This is in a dual Xeon 450 configuration. LHS also has a quad Xeon setup that has the horsepower to break the terabyte mark (dual PCI busses, etc.), it's pretty much the same thing VA Research sells (after all, there's not many providers of quad Xeon motherboards for system integrators: Intel and AMI). With commonly available 18gb drives this would require an external RAID cabinet or two. 36gb drives should be available shortly, and those will solve some of the space and heat problems (you better have a big server room for a terabyte of storage using 18gb drives!).
Blatant commercialism. Yetch.
-- E
Large filesystems mount almost instantly... (Score:1)
-- Eric
Nothing much (Score:1)
If the glitch occurs prior to journal recording then there is nothing to fix.
If the power outage or problem occurs after the journal entry has been made but prior to the commencement of writing then the changes can be rolled forward or back - posted or rejected.
If the problem occurs after journaling but while writing is in progress then the changes can be rolled back and then possibly reposted.
If the problem occurs after journaling and after writing but prior to reconciling the journal then the changes can be rolled back or the journal updated to match the filesystem.
Journaling is good for systems that require very very high reliability - such as banking systems. There is obvious overhead involved in journaling.
An optional, journaling filesystem for Linux would be a nice addition - hey, NTFS for Linux isn't far from being read/write is it?
VA Sells RAID machines up to 1.1TB (Score:1)
We are selling 1.1TB(that's terabyte)machines currently using the Dac960 and external drive enclosures. You can check out our systems at the following URL;
http://www.varesearch.com/products/vs4100.html
They are quite reliable, mostly due to the fact that the author of the Dac960 Linux drivers,(Leonard Zubkoff), works for us.
VA Sells RAID machines up to 1.1TB (Score:1)
Yes.
EXT2 and Raid (Score:1)
The company I work for has been selling Linux based servers running the linux software raid. It works great in the 2.1 kernel, the largest raid we built under software was 150 gig. We did this to backup a clients old raid system. One of are clients has a server which is running 100 gig raid(linux software) and is moving huge databases on it daily without fail and been up over 6 monthes running 24/7 crunching data.
Some issues... (Score:1)
you want to turn on "sparse-superblocks" support when doing a mke2fs. it reduces the number of duplicate superblocks. the catch is that it is really only supported in the late 2.1.x/2.2.x kernels. 2.0.x will bawk like a dying chicken.
Files over 4 GBs (Score:1)
Backup and Restore? (Score:1)
See IBM's web page, which shows how to download an unsupported ADSM client for Linux:
http://www.storage.ibm.com/software/adsm/adserc
reducing reserve space will reduce performance (Score:1)
Reducing the amount of reserved space may save you some space, but it can cost you a *lot* of time. Having 5% reserved space will mean that there will almost always be a "free block" within about 20 blocks from the end of the file.
Unless you want your expensive raid system to spend lots of time seeking, you should keep the 5% min free value, or even increase it to 10% like the BSD folks use. You certainly don't want to constantly run the filesystem close to being full most of the time.
IDE Filesystems' (Un)Reliability (Score:1)
2. Do lots of checksums using sum or md5sum.
Q: Do you get the same checksum each time?
I once had a problem like this where I would get a corrupt byte from the disk about once very 50-100MB of data read. It happened on two different disks, I tried three different IDE controllers, I swapped RAM, I ran RAM tests, I made sure my bus speeds were within spec. One byte every 50-100MB might not be very high, but it was enough to crash my system once or twice a week.
It turned out that I needed to set my RAM speed in the BIOS to the slowest setting, and everything worked. The RAM test programs didn't detect anything, I think because the errors only occured if I was doing heavy disk access at the same time I was doing very CPU intensive things.
Moral of the story: PC hardware is crap and Unix tends to push the hardware much futher than other systems.
Set your BIOS settings down to the very slowest settings and see if the problem goes away. Try swapping components, and try a *good* memory tester (Linux has one called mem86 or something).
Good luck
ext3? (Score:1)
From what I understand, ext3 would be better suited to giant partitions.
Bad memory? Overclocked system? (Score:1)
Some issues... (Score:1)
If you find 3mins long to mount a 120Gb filesystem, you should have seen a Netware server with 13Gb, that take at least 5-15 mins to mount the filesystem...
Netware (Score:1)
A well configured Netware server with updates is very stable. For 3rd party NLM, if they are well written, it shouldn't crash the server (I'm also a NLM developper, and Unix programmer. One of my NLM hadn't crashed and the server uptime is about 2 months, and it's used fairly often, it mainly convert print job from Epson FX-80 format to another printer format. NLM are a pain to develop, develop (and test) most of the NLM under another OS, like Linux then to the last few lines under Netware is the best way, with of course set of routines for easy porting between Unix, Netware and Win32 with the same code.)
What about reiserfs? (Score:1)
There is work in progress to develop a binary tree-based filesystem for Linux, which is currently on second beta. The paper and source files are located at http://idiom.com/~beverly/reiserfs.html [idiom.com] . It is supposedly faster than ext2, and might be better suited for gigantic partitions, although I cannot attest to that, as I have no experience with it. Does anyone here know anything about this?
No prob here (Score:1)
Journaling FS (Score:1)
Doing writes in this way makes writes go MUCH faster. I read a review by one journalist (no pun intended) who didn't believe Sun's claims that it made long sequential writes go 3x faster or more... It did. Unfortunately, Sun haven't (yet) put full journaled FS support into standard Solaris, though there is an option to put "UFS logging" on - it can also be done on the fly. Still, deleting files and creating lots of small ones goes about 5-10x faster when you put logging on.
When do we get a fully journaled file system? ... (Score:1)
Backup and Restore? (Score:1)
The moral of this story: DLTs are a perfectly feasible backup medium. You can get 17GB on one tape.
Definition of Journaled Filesystem (Score:2)
A journalled file system writes all of the proposed changes to control structures (superblock, directories, inodes) into a journalling area before making those writes to the actual filesystem, then removes them from the journal after they have been committed to disk. Thus if the system goes down, you can get the disk into a sane state by replaying/executing the intention journal instead of checking every structure; thus an fsck can take seconds instead of minutes (or hours).
For example, if you're going to unlink the last link to a file (aka delete the file), that involves an update to the directory, inode, and free list. If you're on a non-journalled system and update the directory only, you have a file with no link (see /lost+found); if you update the directory and inode only, you have blocks missing from your free list. Both of these require scanning the whole disk in order to fix; but a journalled system would just update the directory, inode, and free list from the journal and then it would be sane.
Problems with journalled filesystems include conflicts with caching systems (e.g., DPT controllers, RAID subsystems with cache) where the intention journal is not committed to physical disk before the writes to the filesystem commence.
IDE Filesystems' (Un)Reliability (Score:1)
fsck time? (Score:1)
-Jake
anyone know about the BeOS FS? Name the book. (Score:1)
Practical File System Design with the Be File System
by Dominic Giampaolo
From this book, you could literally write your own compatible implementation of BFS for Linux. The question is: would BeOS compatibility be worth missing the opportunity to create a new filesystem tuned for what Linux is used for? The nice thing about dbg's book is that he covers the reasoning behind every decision that he made when developing BFS. Clearly, some of these decisions are closely tied to what BeOS is being targeted for (a single-user power desktop for media professionals), rather than what Linux is most often used for (a multi-user Internet server).
-Jake
IDE Filesystems' (Un)Reliability (Score:1)
For giggles I decided to run the test on my 34G IDE stripeset (2 17G drives). I used a 100M file instead of 10M since I have 64M of ram.
Everything checked out 100% okay after 40 sums.
I would say change BIOS settings / CPU Clock speed to something very conservative and re-run your reliability test. Once everything checks out then you've fixed your problem. Then you can start bring speeds, etc up to find out what breaks what.
I've seen problems like this before a few years ago.. Intermittent failures on a news server I ran.. Started with Linux, went to FreeBSD and finally built a brand new box and the problems went away. I would say that your problems are *definitely* hardware. Overclocking your cpu/memory or failing cpu/memory/drives.
Also IHMO, using Windows 95 as a test of hardware is like using an 85 year old lady to test drive an indy car.. Of course nothings going to go wrong at that speed.
-Jerry (jasegler@gerf.org)
Journaling FS != Log Structured FS (Score:1)
A journaling filesystem is any filesystem that keeps a meta data transaction log so that it can be restored to a consistent state quickly by replaying the log instead of checking every file in the filesystem.
A log structured filesystem, on the other hand, is a filesystem that places all disk writes on the disk sequentially in a log structure, which drastically improves file I/O performance when you have total rewrites of a large number of small files. The log is written into garbage collected segments that gradually free up the room taken by old file versions.
Files over 4 GBs (Score:1)
goodbye to ext2? huh? (Score:1)
Of course, people with big-ass disks and high uptime requirements need that journalling crap. And they'll have it. So don't be dissin' ext2!
OK ... (Score:1)
How does a filesystem of this type allow for a 3x sequential write speed improvement?
I understand what the journaling part is describing, but don't understand how this would be that much faster. Especially under a really heavily loaded server.
/dev
anyone know about the BeOS FS? (Score:1)
Misc problems (can be tuned around) (Score:1)
If you run this using the default parameters and get an unplanned shutdown (crash, power outage, whatever), you are likely to get minor file corruption. To get correct behaviour, you should mount the filesystem in sync mode, and rely on the underlying RAID setup to handle write caching if you need it (as this remove one failure layer).
You will also want to modify e2fsck to avoid silent data corruption. e2fsck will (or would, the last time I was in a discussion with the author on these issues) handle a block that is shared between two files by duplicating the block.
This silently corrupts at least one file. You will probably want to change it to delete both files, possibly making that an interactive question. (Deleting is the default action on the *BSD fsck, BTW).
Eivind.
That fucking **SUCKS** dude! (Score:1)
Linux and traditional *BSD has choosen different policies here; Linux somewhat gambles with the users data and security setup as a tradeoff for faster metadata updates and lower code complexity. This tradeoff is probably OK if you're only going to use it on a normal workstation, without any critical data on it; I guess it can be OK in some server apps too (though I wouldn't do it). *BSD does things "by the book", guaranteeing metadata integrity (and thus avoiding data leaks, and keeping POSIX semantics for e.g. rename). Note that the traditional BSD tradeoff is NOT the same as Linux 'sync'.
The latest development on the BSD side of this is "soft updates", which is safe without the speed penalty.
Now, back to the original poster:
"Deleting is the default action on the *BSD fsck" Oh yeah, I didn't really want those files. They take up too much space anyway. After fsck destroys my data I will have room for more!
I'll take "maybe corrupt" over "kiss your files goodbye" for sure.
We're not talking "Maybe corrupt". We're talking of at least one file being corrupt, and we're talking of the possibility of private data crossing the protection domain between user IDs, and of wrong data or code migrating into setuid programs.
For some applications, it might be an OK tradeoff to silently corrupt one file to potentially make another file OK. However, it is not OK for any of my applications - I need to know that the security policies for files are held; if I can't know this, I want to restore from backup, rather than keep running with corrupt files.
Eivind.
Netware (Score:1)
Netware 5 (Score:1)
Files over 4 GBs (Score:1)
It's called NSS. (Score:1)
AIX and Files over 2 GBs (Score:1)
A year ago when I was working on an AIX system and investigating their new support for file sizes over 2Gig (not 4Gig), I remember it was a bloody pain to switch over. Not only did you have to rebuild your file system and recompile ALL of your applications with the new libraries to get them to support the greater file sizes (I don't remember if you had to recompile apps that didn't care about large files), but once re-compiled, you couldn't use the same binaries with older file systems. On top of all that, there was a significant performance hit (10% to 20%) on file I/O.
Again, I don't remember all the details, but in the end, we decided it was far too painful to implement the changes in our application. YMMV.
18 exabytes should be enough for anyone! (Score:1)
--------------------
Endless Loop ; see Loop, Endless
Loop, Endless; see Endless Loop
When do we get a fully journaled file system? ... (Score:1)
--
Journaling OS! (Score:1)
That is, if you turn the power off and turn it on, the entire OS comes back on to a state within a few minutes of where it was. One example that looks interesting is EROS [upenn.edu].
I have not seen this one in operation, but there are theoretical arguments for their speed claims, and (as they say) it is theoretically impossible for *any* OS based on access lists (such as Unix) to achieve the same level of security that a capability based system can. (Note, I said "can", not "does".)
Regards,
Ben Tilly
Netware (Score:1)
I routinely see netware servers that have uptimes of 400-600 days.. record is 900 days so far (took a polaroid of that one).
If you want some help with your system, I would be happy to help you wih your problem for free. You can contact me at dminderh@baynetworks.com if you'd like.
The new file system in netware 5 will mount & vrepair 1.1 TB in 15 secconds (that's the largest I have seen..I'm sure it will do more..)
And your mount time isn't that bad. Chrystler has a 500 GB volume that takes 22 hours to mount
Global File System (Score:1)
GFS is a 64-bit filesystem. It supports files up to 2^64 bytes in size (on the alpha).
It is much faster than ext2 for moving around big files.
GFS will support journaling by the fall.
http://gfs.lcse.umn.edu
fsck time? can't be worse than NT! (Score:1)
NT may have its faults, but NTFS is not bad in this respect - Linux does not yet have a widely used journalling filesystem that I'm aware of.
Disk Space is the key (Score:1)
Just on the brink of installing one (Score:1)
runs NFS, sendmail, DNS, NIS, httpd for internal
use, gnats for around 60 users. Here is the
plan. Two identical machines with 512M ram and
9.0G disks with OS installed. One machine
would be running as NFS server and the other
machine would have all the servers sendmail,
DNS, NIS etc. The NFS server is connected to a diskarray with 7 18.0G disks and a backup
tape autochanger. I want to leave one of the
disks as a hot spare. I would like to write scripts such that if one machine fails, the other can take over by just running a script.
It is the RAID part that is not clear to me. The
last RAID I checked was Veritas on Solaris which
was a major pain in the neck to manage. Don't
know if managing RAID on Linux is any simpler.
I am inclined to wait till RAID becomes a standard
part of Redhat. Until then, I would rather
depend on the tape backups than
Linux RAID support.
I am curious to hear any experiences on people managing large file systems 100G+.
BTW, I haven't still figure how to use our
Exabyte autochanger effectively with a GPLed
backup sofware. Exabyte tech support wasn't very
useful.
Ramana
Exabyte autoloaders under Linux. (Score:1)
Drop me a note: johnbar@exabyte.com
Exabyte autoloaders under Linux. (Score:1)
- jmb. / exabyte corp.
When the system fails during a journal write.. (Score:1)
Files over 4 GBs (Score:1)
I kinda have to suggest this (shrug), but why couldn't we get the NTFS driver bulletproofed (r&w)?? Other than the anti-MS reason, NTFS isn't a bad FS (and is proven) and there is already substantial work done with it... It'd be great for that "Hey, NT admins, come to Linux?"
But, then again, if people like Tweedie from RH working on designing ext3, why bother with NTFS??
L8r,
Justin
Andrew File System (Score:1)
AFS Not the answer!! (Score:1)
Tried 200 Gig once. (Score:1)
We originally used this as a Usenet news server. We tried 24 seperate volumes to have the maximum number of spindles, but Linux has a limit of 16 SCSI drives in the 2.0 kernels. We ended up creating 12 2-drive stripe sets. (no redundancy) We then created 6 partitions. 5 that were 2 gigs in length, and one with the remainder. We used a patch to allow the partitions to be handled as 2 gig files. This was very fast, and had no FSCK issues as there were no file systems. If a few articles were mangled because of a crash.....
We ended up outsourcing our usenet service, and had this server to reuse. We created 3 volumes of 7 drives each, along with 3 hot spares. (One hot spare in each external drive chassis) Each volume is ~50 Gigs in size. One thing we have found is that if we HAVE to fsck the whole thing, (150 Gigs) you need about 4 hours. The PCI bus just doesn't have the bandwidth to check huge volumes in a reasonable time. We end up checking "/" and mounting it rw. We then mount the rest of the volumes "ro". We can then restart basic services (mail, web) and continue the fsck on the read-only volumes.
It's a balance you have to strike. If you really need that large of a file system, understand the time to restart. For us, just a basic reboot takes 12 minutes. With FSCK, it's ~4-5 hours of time to babysit. If you don't need that much space, look at setting up several individual file servers. It will help spread the load.
Just on the brink of installing one (Score:1)
Files over 2GB? (Score:1)
Andrew File System-It Rocks! (Score:1)
directories, including the home directories of all new accounts, as well as the bulk of system
programs run by users.
Just wanted to let you know. Am at Carnegie Mellon Univ which developed AFS. We are currently using it with 1 Terabyte of space just in our Univ. MIT and U of Mich (Ann Arbor) are among the other colleges that also use it.
When do we get a fully journaled file system? ... (Score:1)
talking windows users is where I draw the line
fsck time? can't be worse than NT! (Score:1)
RAID perfectly reliable? (Score:1)
I'm not saying RAID is a waste of time. It improves reliability a great deal, and the better designs make things go faster. They aren't perfect, though.
Backing up a monster partition is a pain in the neck. If you have a monster database you have little choice, but smaller partitions make life easier.
IDE Filesystems' (Un)Reliability (Score:1)
10485760 Jan 18 22:00 bigfile
after 30 runs of sum, _ALL_ checksums are the same
41865 10240 (My setup is an ASUS TXP4 with Maxtor 1.2 gig EIDE.
True, it's not ultra DMA, but as you can see, it
checks out fine here.
Files over 4 GBs (Score:1)
IDE Filesystems' (Un)Reliability (Score:1)
I'm using a AMD K6-2 on an Asus P5A-B motherboard (uses Alladin (Ali1xxx) with a Quantum Fireball ST3.2A.(UDMA 2). Don't know the transfer rates.
Greetz, Takis
Files over 4 GBs (Score:1)
Re:How Reliable are Enormous Filesystems in Linux? (Score:1)
Tape Changers Under Linux (Score:1)
Tape Changers Under Linux (Score:1)
The machine I'm sitting at has an APS Technologies changer attached to it, of unknown model. The tape changer says "DATLoader600", but that is not an APS name.
Here is a link to the APS website. [apstech.com]
Are Adaptec RAID cards supported yet? (Score:1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Erik Norvelle
Please try the IDE test above! (Score:1)
Intel p255MMX (Yeah, it's overclocked, bus is running at 75Mhz.)
128Mb edo, bios default for 60ns in my award bios.
(simms are 4x 32Mb, 2x TI and 2x Panasonic)
Mobo: Asus T2P4 Cache: 512Kb
HDD's:
1.0 Gb samsung pio-4
2.5 Gb bigfoot pio-4
4.3 Gb bigfoot pio-4
On all the disc the outcome was the same,
I "summed" for 30 times each disc.
I also tried it on my nameserver,
Linux 2.0.36 + egcs patch
AMD 386DX40
Motherboard = unknown
8Mb "topless" (8x1Mb)
420 Mb Seagate
BIOS MEM Setting: as conservative as you can get
I tried it 20 times here, also no difference in
sum.
Weird shit happenin' in yer machine...
Try other ide cables, I had problems with that in the past, my hdd's used to spin down (a bit) click and the get up to normal speed again. Bad connectors caused the hdds to reset once in a while, this cost me some write and read errors including some badblocks! (my system tried to read/write while the heads were resetting -> the "click" sound).
Anyone here who has/had this problem too?
Mounting large volumes with Netware... (Score:1)
/z80
Just on the brink of installing one (Score:1)
Cheers,
Cam.
[mailto]
Cameron_Hart@design.wnp.ac.nz