Forgot your password?
typodupeerror
Data Storage Software Intel Linux

Optimizing Linux Systems For Solid State Disks 207

Posted by Soulskill
from the bit-by-bit dept.
tytso writes "I've recently started exploring ways of configuring Solid State Disks (SSDs) so they work most efficiently in Linux. In particular, Intel's new 80GB X25-M, which has fallen down to a street price of around $400 and thus within my toy budget. It turns out that the Linux Storage Stack isn't set up well to align partitions and filesystems for use with SSD's, RAID systems, and 4k sector disks. There are also some interesting configuration and tuning that we need to do to avoid potential fragmentation problems with the current generation of Intel SSDs. I've figured out ways of addressing some of these issues, but it's clear that more work is needed to make this easy for mere mortals to efficiently use next generation storage devices with Linux."
This discussion has been archived. No new comments can be posted.

Optimizing Linux Systems For Solid State Disks

Comments Filter:
  • by ultrabot (200914) on Saturday February 21, 2009 @11:37AM (#26940977)

    However, for
    many of us who require better-than-average data security, the matter of SSD's read/write behaviour makes the devices extremely vulnerable to analyses and discovery of data the owner/author of which believes to be inaccessible to others: 'secure wiping', or lack thereof, is the issue.

    Obviously you should be encrypting your sensitive data.

    Also, it should be no problem to write a bootable cd/usb that does a complete wipe. Just write over the whole disk, erase, repeat. No wear leveling will get around that.

  • If I mount /home on a separate drive, (good to do when upgrading) the rest of the Linux file system fits nicely on a small SSD.

  • Re:Is it only linux? (Score:3, Informative)

    by Jurily (900488) <jurily@noSpAM.gmail.com> on Saturday February 21, 2009 @11:49AM (#26941071)

    unfortunately the default 255 heads and 63 sectors is hard coded in many places in the kernel, in the SCSI stack, and in various partitioning programs; so fixing this will require changes in many places.

    Looks like someone broke the SPOT rule.

    As for other OSes:

    Vista has already started working around this problem, since it uses a default partitioning geometry of 240 heads and 63 sectors/track. This results in a cylinder boundary which is divisible by 8, and so the partitions (with the exception of the first, which is still misaligned unless you play some additional tricks) are 4k aligned.

  • by piripiri (1476949) on Saturday February 21, 2009 @11:50AM (#26941073) Journal

    Sure. There are *lots* of considerations beyond speed to want SSDs

    And SSD drives are also shock-resistant.

  • by Kjella (173770) on Saturday February 21, 2009 @12:11PM (#26941211) Homepage

    Also, it should be no problem to write a bootable cd/usb that does a complete wipe. Just write over the whole disk, erase, repeat. No wear leveling will get around that.

    At least for OCZ drivers, the user capacity is several gigs lower than the user capacity, like 120GB to 128GB. I don't know about your data but pretty much can ble left in those 8GB. The only real solution is to not let sensitive data touch the disk unencrypted.

  • Re:Is it only linux? (Score:3, Informative)

    by mxs (42717) on Saturday February 21, 2009 @12:17PM (#26941245)

    Of course it goes beyond just Linux. Microsoft is aware of the problem and working on improving its SSD performance (they already did some things in Vista as the article states, and Windows 7 has more in store; google around to find a few slides from WinHEC on the topic).

    The problem with Windows w.r.t. optimizing for SSDs is that it LOVES to do lots and lots of tiny writes all the time, even when the system is idle (and moreso when it is not). Try moving the "prefetch" folder to a different drive. Try moving the system log event files to a different drive. And try to keep an eye out for applications that use the system drive for small writes, extensively (or muck about in the registry a lot). These are the hard parts. The easier parts would be to make sure hibernation is disabled, pagefiles are not on the SSD (good luck in getting Windows to not use pagefiles at all; possible, but painful even if you have a dozen gigs of memory), prefetching is disabled, the filesystem is properly aligned, printer spools, etc. With only the things Windows provides, it is painful to attempt to prolong your SSD's life (this is not just about performance; remember that you only have a limited amount of erases until the drive becomes toast).

    There are some solutions; MFT for Windows (http://www.easyco.com/) provides a block device that consolidates many small writes into larger ones and does not overwrite anything unless absolutely necessary (i.e. changes are written onto the disk sequentially; overwriting only takes place once you run out of space). It is very, very costly, but it does its job well. Performance skyrockets, drive longevity improves by an order of magnitude.

    You can also use hacks such as Windows SteadyState; This also streamlines writes (but adds another layer of indirection). Performance improves, but you get to deal with SteadyState-issues. EFT also works (and is less of a GUI-y system, though largely providing the same services even on Windows 2000/XP); you have got to be careful though, if your system tends to lose power or crash, all the changes since the last boot will be lost; EFT can be made to write out all the changes it has accumulated -- but after that, the only way to reenable it is to restart the system.

    Windows is not particularly nice to SSDs when used as a system disk. For data partition it is not quite as bad (although if you deal with many small writes, you might still run into heaps of trouble). The optimizations related here for Linux are applicable to Windows as well (aligning filesystem blocks to erase-blocks and 4k nand-sectors). You would also want to attempt to move stuff that does lots of small writes to a different (spinning) disk -- system logs, for instance, and most spool directories. You'd also want to make absolutely sure that you do not have access time updates enabled; each of those is, essentially, a write (even if ultimately consolidated).

  • by NekoXP (67564) on Saturday February 21, 2009 @12:22PM (#26941283) Homepage

    > So why should I get a SSD vs. a CF card?

    10 times better performance and wear-leveling worth a crap.

  • by nedlohs (1335013) on Saturday February 21, 2009 @12:53PM (#26941529)

    It will outlast a standard hard drive by orders of magnitude so it's completely not an issue.

    With wear leveling and the technology now supporting millions of writes it just doesn't matter. Here's a random data sheet: http://mtron.net/Upload_Data/Spec/ASIC/MOBI/PATA/MSD-PATA3035_rev0.3.pdf [mtron.net]

    "Write endurance: >140 years @ 50GB write/day at 32GB SSD"

    Basically the device will fail before it reaches the it runs out of write cycles. You can overwrite the entire device twice a day and it will last longer than your lifetime. Of course it will fail due to other issues before then anyway.

    Can there be a mention of SSDs without this out-dated garbage being brought up?

  • by raynet (51803) on Saturday February 21, 2009 @12:54PM (#26941543) Homepage

    Unfortunately flash SSDs usually have some percentage of sectors you cannot directly access, these are used for wear leveling and bad sector remapping. So when you dd with /dev/zero, it is quite possible that some part of the original data is left intact. And there can be quite alot of those sectors, I recall reading on one SSD drive that had 32GiB flash in it, but had 32GB available for the user, so 2250MiB was used for wear leveling and bad sectors (helps to get better yealds if you can have several bad 512KiB cells).

  • by tinkerghost (944862) on Saturday February 21, 2009 @01:12PM (#26941705) Homepage

    So why should I get a SSD vs. a CF card?

    Your CF card is going to use the USB interface which maxes out at about 40Mbps as opposed to using an internal SSD's SATAII interface which maxes at 300Mbps. Not quite an order of magnitude, but close.

    On the other hand, if you're going to use an external SSD connected to the USB port, then you wouldn't see any difference between the 2 in terms of speed. Lifespan might be longer w/ the SSD due to better wear leveling, but in either case you're probably going to lose or break it before you get to the fail point.

  • by Anonymous Coward on Saturday February 21, 2009 @01:31PM (#26941857)

    A real SSD has several advantages over using CF cards, but not for the reasons you state.

    With a simple plug adapter, CF cards can be connected to an IDE interface, so speeds won't be limited by interface speed. The most recent revision of the CF spec adds support for IDE Ultra DMA 133 (133 MB/s)

    A couple of additional points, just because I love nitpicking:
    - A USB 2.0 mass storage device has a practical maximum speed of around 25 MB/s, not 40 Mb/s.
    - The so-called SATA II interface (that name is actually incorrect and is not sanctioned by the standardization body) has a maximum speed of 300 MB/s, not Mb/s.

  • chs no longer used (Score:1, Informative)

    by Anonymous Coward on Saturday February 21, 2009 @01:46PM (#26941959)

    i haven't yet found a sata device
    (even doms) that require chs addressing.

    clearly it was a mistake to use hardware
    quirks to address sectors, but the again,
    ata became a de facto standard before
    realized it might become one.

  • by A beautiful mind (821714) on Saturday February 21, 2009 @02:00PM (#26942047)
    There are a few tricks up the manufacturer's sleeve to make this slightly better than it really is:

    1. large block size (120k-200k?) means that even if you write 20 bytes, the disk physically writes a lot more. For logfiles and databases (quite common on desktops too, think of index dbs and sqlite in firefox for storing the search history...) where tiny amounts of data are modified, this can add up rapidly. Something writes to the disk once every second? That's 16.5GB / day, even if you're only changing a single byte over and over.

    2. Even if the memory cells do not die, due to the large block size, fragmentation will occur (most of the cells will have a small amount of space used in them). There has been a few articles about this that even devices with advanced wear leveling technology like Intel's exhibit a large performance drop (less than half of the read/write performance of a new drive of the same kind) after a few months of normal usage.

    3. According to Tomshardware [tomshardware.com] unnamed OEMs told them that all the SSD drives they tested under simulated server workloads got toasted after a few months of testing. Now, I wouldn't necessary consider this accurate or true, but I'd sure as hell would not use SSDs in a serious environment until this is proven false.
  • by berend botje (1401731) on Saturday February 21, 2009 @02:07PM (#26942105)
    All nice and dandy, but these figures aren't exactly honest. In a normal scenario your filesystem consists for a large part on static data. These blocks/cells are never rewritten. Therefore the writes (for logfiles etc) are concentrated on a small part of the disk, wearing it out rather more quickly.

    Having a few Compact Flash disks wear out in the recent past, I'm not exactly anxious to replace my server disks with SSD.
  • by berend botje (1401731) on Saturday February 21, 2009 @02:11PM (#26942135)
    Say you 100 cells and can write 10 times to each cell.

    Having every cell written to nine times: 100 * 9 = 900 writes and you still have a completely working disk.

    Writing 900 writes to the first couple of cells: you now have 90 defective cells. In fact, as you still have to rewrite the data to working cells, you have lost your data as there aren't enough working cells.
  • by karnal (22275) on Saturday February 21, 2009 @02:38PM (#26942331)

    Why is this informative? CF with an adapter is NOT USB.

    From my experience, using an adapter puts it on the native interface - notably, with CF, it's easiest to put the device into a machine that has a native IDE (not SATA) interface. CF is pin compatible with IDE.

    Now, in the current offering of SLC/MLC "drives" you can actually get better read/write since they "raid" for lack of a better term the internal chips. I'm using a transcend ATA-4 CF device that gets around 30MB/sec read/write in a machine in my garage; it's an SLC device that isn't their top of the line, but it was more cost-effective.

    So, using the IDE/ATA-4 interface on the CF card, it gets lower CPU utilization than a USB device. Still doesn't hit the 40MB/sec you quoted, but 40MB/sec is a pipe dream on USB in my experience.

  • by Dr. Ion (169741) on Saturday February 21, 2009 @03:19PM (#26942637)

    Your CF card is going to use the USB interface

    This is Informative?

    CF cards are actually IDE devices. The adapters that plug CF into your IDE bus are just passive wiring.. no protocol adapter needed.

    It's trivial to replace a laptop drive with a modern high-density CF card, and sometimes a great thing to do.

    The highest-performance CF cards today use UDMA for even higher bandwidth.

    HighSpeed USB can't reasonably get over 25MB/sec from the cards using a USB-CF adapter, but you can do better by using its native bus.

  • ZFS L2ARC (Score:1, Informative)

    by Anonymous Coward on Saturday February 21, 2009 @03:34PM (#26942769)

    I think, Theodore should look into technologies like the ZFS L2ARC (just look at using SSD as an additional cache to supplement disks based on rotating rust. The L2ARC stores recently evicted pages from the primary ARC (the Adjustable Replacement Cache) of ZFS on SSD. From my view this is a more reasonable usage of SSD than just as another primary storage media.

    I recently wrote an article about the mechanism of ARC and L2ARC in conjunction with SSD in my blog, but i don't want to slashdot my site ;)

  • by pla (258480) on Saturday February 21, 2009 @03:36PM (#26942793) Journal
    So why should I get a SSD vs. a CF card?

    CF works passably in WORM-like scenarios, where you basically use it in read-only mode and update it rarely and in big chunks. For random R/W access, CF lacks wear leveling to give it a tolerable life expectancy... Thus you commonly see it used in embedded devices such as routers and dumbterms where you may update the firmware or OS every few months; You don't see it used much in real, live writable FSs.

    It also tends to have rather poor performance, with reads in the sub-5MB/s range and writes taking forever. So again, using a 32MB CF to boot a router, works great; Using a 32GB CF as the system partition for a modern desktop PC (even with some solution to the limited erase lifetime, such as a UnionFS against a ramdisk with commit-on-shutdown), you can expect 10+ minute boot times.
  • by MoonBuggy (611105) on Saturday February 21, 2009 @03:44PM (#26942851) Journal

    So in effect, instead of 'burning' out a specific section of an SDD, they will simply burn out the entire disk at once due to wear leveling?

    Technically speaking, yes, the drive is more likely to go from 'all cells functioning' to 'many cells dead' in a relatively short amount of time due to wear levelling, whereas without it the mode of failure would be a more gradual reduction in functioning cells.

    Practically speaking, however, these things support an awful lot of read/write cycles. On the order of a million or more, according to the data I could find. Unfortunately the Intel datasheet for the drive mentioned in the summary doesn't actually include write-cycle data, though.

    A quick and dirty calculation (not taking into account block size, etc.) for drive lifetime is simply (capacity)*(write cycles)/(write speed).

    Imagine a drive with no wear levelling. Say you have a 1GB file, the entirety of which is being continually rewritten to the same 1GB section of the drive. A million read/write cycles means you need to write approximately 1,000,000 GB (that's 1000TB!) to that 1GB section of drive to kill it. Again, somewhat inaccurate in the real world, but good enough for a back of the envelope estimate. Allowing a fairly generous write speed of 100MB/s, writing to that same 1GB area of disk 24/7, would burn it out in around 115 days - about 4 months. In that time, remember, you'll have generated 1000TB of data - that's certainly not insignificant, even for fairly major applications, but it could be done, and you're left with a drive that's got 1GB less capacity than it started with.

    Now consider the same case with wear levelling. Assume for the sake of simplicity it functions perfectly, and ignore block size. On an 80GB drive, continually overwriting that same 1GB file, it will simply cycle through the entire 80GB capacity of the drive repeatedly rather than just hammering the same 1GB section. This means that you suddenly increased the effective lifespan by a factor of 80 (again, not entirely real-world due to the fact that the drive would normally have data filling some of the rest of that 80GB, but sufficient to get the point across). You're now looking at over 25 years of continuous writing, by which time you will have generated 8 yottabytes of data.

    That's why wear levelling is a good thing. Even on a disk that's completely full (not something that happens particularly often, but still worth thinking about) the drive itself has some built in excess capacity to use for wear reduction.

  • Re:Is it only linux? (Score:5, Informative)

    by tonyr60 (32153) on Saturday February 21, 2009 @04:00PM (#26942973)

    Sun's new 7000 series storage arrays use them, and that series runs OpenSolaris. So I guess Solaris has at least some SSD optimisatioons... http://www.infostor.com/article_display.content.global.en-us.articles.infostor.top-news.sun_s-ssd_arrays_hit.1.html [infostor.com]

  • by steveha (103154) on Saturday February 21, 2009 @04:33PM (#26943281) Homepage

    Why not functionally group files to decrease or eliminate fragmentation? Or maybe this is already done.

    In a Linux system, this is easily done, but few people bother.

    Most of the write activity in Linux is in /tmp, and also in /var (for example, log files live in /var/log). User files go in /home.

    So, you can use different partitions, each with its own file system, for /, /tmp, /home, and /var.

    The major problem with this is that, if you guess wrong about how big a partition should be, it's a pain to resize things. So my usual thing is just to put /tmp on its own partition, and have a separate partition for / and for /home.

    The /tmp partition and swap partition are put at the beginning of the disc, in hopes that seek penalties might be a little lower there. Then / has a generous amount of space, and /home has everything left over.

    When a *NIX system runs out of disk space in /tmp, Very Bad Things happen. Far too much software was written in C by people who didn't bother to check error codes; things like disk writes don't fail often, but when /tmp is 100% full, every write fails. A system may act oddly when /tmp is full, without actually crashing or giving you a warning. So, the moral of the story is: disk is cheap, so if you give /tmp its own partition, make it pretty big; I usually use 4 GB now. However, if you run out of disk space in /var, it is not quite as serious. Your system logs stop logging. And, many databases are in /var so you may not be able to insert into your database anymore.

    The main Ubuntu installer is fast, because it wipes out the / partition and puts in all new stuff. So, if you have separate partitions for / and /home, life is good: you just let the installer wipe /, and your /home is safely untouched. It's annoying when you have /home as just a subdirectory on / and you want to run the installer. But, by default, the Ubuntu installer will make one big partition for everything; if you want to organize by partitions, you will need to set things up by hand.

    steveha

  • by Mattsson (105422) on Saturday February 21, 2009 @04:46PM (#26943401) Homepage Journal

    Your CF card is going to use the USB interface which maxes out at about 40Mbps as opposed to using an internal SSD's SATAII interface which maxes at 300Mbps. Not quite an order of magnitude, but close.

    There are three factual errors in that statement.
    1. CF-cards can be connected directly to the ATA-port via a simple passive connector-adapter and therefor have a theoretical maximum transfer speed of 133MB/s, which roughly translates to 1300Mbps. There's even adapters with room for both a master and slave CF-card in the same shape, size and connector position as a 2.5" ATA drive, specifically made to use CF-cards in laptops.
    2. USB is 480Mbps.
    3. SATA is 3000Mbps

    The big speed-difference between SSD and CF is due to the construction of the devices themselves, not the interface that connects them to the computer.
    A fast CF-card can get you around 40MB/s and at the moment they also top out at 32GB sizes and they're not made to handle long term random write operations.
    A fast SSD can get you all the way to the theoretical maximum of SATA, around 300MB/s, and are available in much bigger sizes.

  • by Hal_Porter (817932) on Saturday February 21, 2009 @04:54PM (#26943481)

    CHS disappeared ages ago. The maximum device supported was ~8 Gbyte (1023 cylinders * 255 heads * 63 sectors * 512 bytes)

  • by tytso (63275) * on Saturday February 21, 2009 @05:03PM (#26943575) Homepage

    Because of this, I imagine that the author would like Linux devs to better support SSD's by getting non-flash file systems to support SSD better than they are today.

    Heh. The author is a Linux dev; I'm the ext4 maintainer, and if you read my actual blog posting, you'll see that I gave some practical things that can be done to support SSD's today just by better tuning parameters given to tools like fdisk, pvcreate, mke2fs, etc., and I talked about some of the things I'm thinking about to make ext4 better at support SSD's better than it does today.....

  • by tytso (63275) * on Saturday February 21, 2009 @05:49PM (#26943967) Homepage

    Flash using MLC cells have 10,000 write cycles; flash using SLC cells have 100,000 write cycles, and are much faster from a write perspective. The key is write amplification; if you have a flash device with an 128k erase block size, in the worst case, assuming the dumbest possible SSD controller, each 4k singleton write might require erasing and rewriting a 128k erase block. In that case, you would have a write amplification factor of 32. Intel claims that with their advanced LBA redirection table technology, they have a write amplification of 1.1, with a wear-leveling overhead of 1.4. So if these numbers are to be believed, on average, over time, a 4k write might actually cost a little over 6k of flash write. That is astonishingly good.

    The X25-M uses MLC technology, and is rated for a life for 5 years writing 100GB a day. In fact, if you have an 80GB worth of flash, and you write 100GB a day, with an write amplification and wear-leveling overhead of (1.1 and 1.4, respectively), then over 5 years you will have used approximately 3200 write cycles. Given that MLC technology is good for 10,000 write cycles, that means Intel's specification has a factor of 3 safety margin built into them. (Or put another way, the claimed write amplification factors could be three times worse and they would still meet their 100GB/day, 5 year specification.)

    And 100GB a day is a lot. Based on my personal usage of web browsing, e-mail and kernel development (multiple kernel compiles a day), I tend to average between 6 and 10GB a day. When Intel surveyed system integrators (i.e., like Dell, HP, et. al), the number they came up with as the maximum amount a "reasonable" user would tend to write in a day was 20GB. 100GB is 10 times my maximum observed write, and 5 times the maximum estimated amount that a typical user might write in a day.

    For those of you who are Linux users, you can measure this number yourselves. Just use the iostat command, which will return the number of 512 byte sectors written since the system was booted. Take that number, and divide it by 2097152 (2*1024*1024) to get gigabytes. Then take that number and divide it by the number of days since your system was booted to get your GB/day figure.

  • tytso (Score:3, Informative)

    by r00t (33219) on Saturday February 21, 2009 @08:52PM (#26945227) Journal

    "tytso" is Theodore T'so.

    He and Remy Card wrote ext2. He and Stephen Tweedie wrote ext3. He and Ming Ming Cao wrote ext4.

    He maintains the filesystem repair tool (e2fsck) and resizing tool for those filesystems.

    He also created the world's first /dev/random device, maintained the tsx-11.mit.edu Linux archive site for many years, and wrote a chunk of Kerberos. He's been the technical chairman for many Linux-related conferences. He pretty much runs the kernel summit.

    He's certainly not a kid. I think he's about to turn 40.

    Really, Intel ought to give tytso piles of free SSD hardware before it goes on sale. This would help Intel by encouraging tytso to optimize Linux for Intel's SSD hardware.

  • Re:Is it only linux? (Score:2, Informative)

    by c0t0d0s0.org (1483833) on Sunday February 22, 2009 @03:07AM (#26946939)
    You should look at the L2ARC and seperated ZIL features at ZFS in Solaris and Opensolaris. It used the SSD in the way you want it.

"A mind is a terrible thing to have leaking out your ears." -- The League of Sadistic Telepaths

Working...