

Linux Databases with Huge Tables? 220
Eugene writes "I am working on a project to create an expansive network monitoring and all encompassing database solution. With archival and ageing of all data we are looking at a database that grows by 40Gb a year, with some tables being well over the 2Gb file limit in Linux/x86.
We have narrowed ourselves down to Oracle 8i (for it's stalwart referencial integrity checking, and PL/SQL for tracing across aged route information) and PostgreSQL (for it's object polymorphism and CIDR data types for IP addresses). We are concerned with the robustness of Postgres compared to "grown up" Oracle, as well as file size limitations. Can anyone give us any advice on these issues? "
Re:What about sybase RAW IO ? (Score:1)
You don't need the patch for "raw devices" to use
Sybase on Linux. You just use partitions as block devices. The mailing list archives have more on the issues involved in choosing whether to use a file system or "raw device" ( archive [swarthmore.edu] ), i.e. recoverability.
We are using Sybase on Linux for several web based services. We have chosen it for speed, it's backup and recovery features, it's support for referential integrity, stored procedures, and all the other things one would expect from a SQL-89/92 compliant database engine. MySQL just doesn't offer those features, although there are ways to imitate them.
At a fraction of the cost of Oracle ( and DB2 ), it's hard not to take a good look at Sybase. At least if money is an object. Oh and by the way....
ASE 11.0.3 is FREE! I don't think there is another database engine with as many features, robustness, and speed that make that claim.
Fire your architect (Score:1)
Re:What about sybase RAW IO ? (Score:1)
Um, while you can use partitions w/o filesystems, it's still not raw I/O.
The benefit is really that, since you're not running a filesystem, you won't sit through endless fsck activity after a crash. Your database server will need to perform its own recovery, though.
And the 2GB limit to a "database device" applies. Sybase's "database device" seems similar to DB2's "tablespace". Probably goes for Oracle, as well.
Since a database (and even a table) may span multiple "database devices", this doesn't present a problem.
I've run multiple 2.5 GB databases with heavy loading using Sybase ASE 11.0.3 on RH 6.0 with excellent performance and reliability.
IMHO, Sybase delivers the power and reliability, without the extra complexity of Oracle and DB2.
Just my $0.02, spend it wisely... :-)
Gordon.
Re:IBM DB2 for Linux? (Score:1)
I completely understand the position of this guy. We have several TB of old data, some of which we are having to rent MVS time to process, a lot of which comes from companies that we have taken over, and all of which really needs to be sorted and folded into our setup (AIX, DB2, SAP) within the next few years while people still remember what they were doing and what the *gotchas* were with their accounting and all. I think that Linux/DB2 will do the job as long as people are really using it lightly (relatively) and Linux DB2 is "real" DB2 -- from Linux to AIX without a hiccup.
Of course, with a less insane load (or a nice four way box), I am sure that things would run better, but they aren't doing to badly now for me.
If I can get another 1GB RAM, I will be able to add more users, but for the time being Linux DB2 is running like a typical IBM product -- slowly, but with the reliability of a Cummins deisel.
Re:Choose Life, Choose Oracle? (Score:1)
Re:better get a mainframe (Score:1)
If you needed to access all of that data at once, day in and day out, while running multiple jobs sorting that same data with maximum uptime, that would be ideal. But if you needed the uptime, UNIX would be cheaper (HACMP with two S80s at $350,000 each with disk and 8CPUs and 4+GB RAM)(and a RAMAC or a 7133 stack on two SSA loops) for that application.
Both UNIX and mainframes have their place (even AS400s have their place in large environments)(not to say that they don't do fine in small ones either), but I would say that this is a toss-up between a commercial or a free UNIX and a commercial or a free database. I would vote for Linux and DB2, probably, as long as the load wasn't too heavy.
And don't assume that you still need to spend $50,000,000 on a mainframe. You can get really nice mainframe clusters (!) for less that $15,000,000 these days!
Oracle, MySQL and PostgreSQL? (Score:1)
Due to Oracle's architecture, hardware resources had better be substantial in terms of CPU, RAM etc. However, it is hard to go wrong with either Oracle or DB2 or Informix or Sybase ASE. My suggestion is to either use Oracle 8i with data cartridges Or Informix Universal server or DB2 with Data Extenders to give you 1 DB with the features that you are looking for in PostgreSQL. Simplifies things and makes it more reliable.
On the other hand, if politically things are viable, Versant and Objectivity (at a minimum) have Object DBMS on Linux already. Much better OO paradigm and faster and less resource intensive. This is especially suitable if you use C++ or Java. if you're a SQLhead, forget I mentioned ODBMS.
MySQL does not have transactions so unless you can gurantee that your app doesn't want rollback, you shouldn't consider it for heavyweight stuff.
Make no mistake, PostgreSQL is good but it obviously does not have the huge testing base that Linux has. The question is, does it have the testing base that Oracle, DB2, Informix or Sybase have? You decide.
size > 2GB NOT important (Score:1)
Re:Database choices... (Score:1)
Re: Linux Databases with Huge Tables? (Score:2)
I have used Oracle 7 (not under Linux) in a banking environment for mission critial apps, with large amounts of data (100s Gb's). Performance is excellent (with a good DBA), scalability is excellent, robustness is excellent, architecture is showing its age.....
I'm sure MySQL is excellent for what it was designed: free, well understood, reference SQL
platform. However it's my contention that for the sorts of data volumes you are looking at Oracle/Informix/Sybase/DB-2 are most likely to be the way to go.
Choose Life, Choose Oracle? (Score:2)
This has nothing to do with Linux, as all of (fine) products are fully supported on Linux & other fine Unices , Although, personally, I'd choose Linux as my platform for the future, which ever Database product you'll eventually use...
As for the 2GB file limit thingish, worry not, because Solaris had the same problem until a patch was released for 2.5.1, and I still recall running a 70GB Oracle 7.3 (!) DB on that configuration, you just use as many 2GB files as you wish.
Oracle 8i has all the features you should require for what you seem to need:
On the database side
Apart from features & performance there's one thing you want to keep in mind:
When the Sh*t hits the fan, you'll wan't Oracle's date-rcovering features.
Sorry if this looks like an Oracle Ad, Oracle does have it's ugly sides, I'll be the first to admit that, but there is no real grounds for comparing Oracle 8i with MySQL & PGSQL, or any of the other currenty GPL'ed or free DB's I've seen.
Noooo Oracle!!! (Score:1)
Re:Some good information... (Score:2)
at my site. 10 million records. No Problem, speed
is adequate (some 0.01 seconds for _complicated_
queries. Also, since it's running on Alpha it
won't have Problems with the 2Gig Limit.
- Speed isn't a problem.
- Large Files aren't a problem.
But:
- MySQL on Alpha isn't very stable. I've had some
months problems until a version came out which
would even compile properly. It compiles now out
of the box, but still, isn't stable
- LOCKING. Locking a table with 10M records sucks,
because nobody will even be able to read it while
it's locked.
In the end: I wouldn't do it again this way.
Re: Linux Databases with Huge Tables? (Score:2)
PostGreSQL has the features he needs, and the latest version is very stable (as stable as Oracle). Its big problem: PERFORMANCE. While its performance is acceptable on a 2gb database, you can forget it on a 40gb database.
I would say that if he can afford Oracle, to go Oracle. Oracle is proven in large-database environments. Informix, SyBase, and Adabas-D would also be good choices. If he's interested in more obscure databases, Empress has a nice 4GL. SolidTech's stuff looks pretty good too.
Doing a search for "linux database" is instructive....
-E
Why 'troll'? (Score:2)
SGI Iris and Solaris are excellent choices for jobs that have outgrown Linux. That's the beauty of Linux -- once you have outgrown Linux, you can move up to bigger computers. And with 40gb per year of data, we're talking 400gb within a few years.
I have personally fsck'ed a 40gb ext2 partition. It is *NOT* fun, it takes over 20 minutes even on the latest/fastest RAID subsystems! So it is indeed appropriate to question whether Linux is the proper solution for this problem. After all, this isn't Windows NT -- we don't have to say "Linux everywhere" to win (just "Windows NT nowhere", grin).
-E
You're out of date, charlie. (Score:2)
PostGreSQL's only problem nowdays is performance on huge databases. Well, replication is still a problem too, as is hot backups, but performance is the killer here. Oracle is proven at running huge databases, PostGreSQL is not. If I were thinking 40gb of data, I'd look harder at Oracle than at PostGreSQL.
-E
Re:Oracle 8i on Linux (Score:3)
OCI programming? Uhhhhgghhh. It's cool, but very different/difficult to the uninitiated. The Oracle distribution comes with some good OCI sample programs, but I wish the interface were like mysql's C interface.
With OCI, you have to bind every one of your columns before doing an insert,update, or query.
I suppose that one could write a C wrapper library around OCI which would do that binding for you.
Essentially, DBD::Oracle is such a beast, except it being for the use of perl.
I was at one company where the Oracle consultants suggested using a perl cartridge with Oracle's web server because DBD::Oracle wasn't "supported". I went on the explain that DBD::Oracle is really just an OCI application and that it indeed is something Oracle should look at as being "supported".
By the way, DBD::Oracle (along with DBI) is an excellent piece of code that powers many a site.
Re:SQL 7 (Score:1)
Linux is not a primary dev platform for Oracle (Score:2)
Oracle, on the other hand, is developed on Solaris and then ported to other platforms. So if you're going to run a huge Oracle installation, do it on Sun hardware. Use raw I/O (can't do this yet under Linux) and stick the (sievelike) Solaris boxes behind a firewall, for god's sake. But please don't be a martyr and run a big Oracle installation on Linux... yet.
Don't forget that Oracle is and will remain a complex beast. My personal viewpoint is that Oracle is the C++ of databases -- highly flexible, can be outrageously fast, but you MUST know what you are doing. Hire a professional DBA if you don't... a big installation like you describe will rapidly spin out of control unless someone skilled is around to manage it.
Re:What about sybase RAW IO ? (Score:1)
ASE 11.9.2 is a fully supported product from Sybase and is priced along the same lines as ASE on NT.
Just clearing the air...
Another option is Solid (Score:1)
Alas, large files don't work on 32-bit Linux (Score:2)
In any event, it doesn't matter with Oracle, or Solid, or any other halfway rationally designed database. They all allow multiple tablespaces, or datafiles, or whatever you want to call it. It isn't simply a matter of 32-bit compatibility either; there's the little matter of disk/filesystem limits and load sharing.
Postgresql is good (Score:4)
If your data is critical and 100% availability is essential, I would probably still go for oracle, as it has a far more mature background and, whilst closed source, is very stable. Otherwise, postgresql is a very good choice. It handles tables of effectively infinite size (it just breaks them into 1GB chunks to avoid the filesystem limit). And it has the bonus that is is open source. If you need a feature, you can add it. I am sure that if you are part of an organisatin generating 40GB of data a year, you probably have a few programmers floating around to add enhancements.
As an addition, I have found the postgres development team to be VERY responsive and helpful. Personally I would say they are probably the most active and helpful open source dev team I have seen.
If you want something out of the box, with 100% availability, go oracle. If you want something that you can control, that will grow with you, and will eventually get 100% reliability (it is very close right now), go postgres.
Fast disk access, other database engines (Score:2)
Disk journaling and a fast filing system are also very important. It looks like Reiserfs is going to beat SGI to having a journal, and it's already faster than e2fs. I'd say give that a look, too.
Re:Be aware of the complexity of Oracle (Score:1)
I've played about with Postgres and, while it's more 'professional' than MySQL, I still wouldn't like to bet a large amount of data and my company on it. I'd be quite happy putting that volume of data on Oracle, though.
Suspect an Intel-based PC, even a big server, is going to have a bit of difficulty with that much data. Have you thought of getting a Sun or an Alpha?
Re:What about sybase RAW IO ? (Score:1)
It does work, though, although the performance gain isn't huge.
Why not Informix? (Score:1)
Re:Another option is Solid (Score:2)
Re:DBD::Oracle Linux (Score:1)
PLEASE tell me you added the oracle library dir to /etc/ld.so.conf?? That's generally the problem. (Oracle installer dosn't do that for you)
--Dan
Re:Perl and Monitoring Oracle (Score:1)
With that in mind, all the C database code I write is through wrappers designed to work the same way as DBI, but I only write it for the databases I actually work with. (Postgresql and Oracle)
Am I re-inventing the wheel or is this something that's only been done for perl?
--Dan
Route tracing? (Score:1)
Re:mysql? (Score:1)
ie: pg_dump |gz |split -b 670m
----------------------------------------------
Matt on IRC, Nick: Tuttle
PostgreSQL and Large Tables on Linux (Score:5)
When they do become huge, you can then move the segments onto different filesystems to ease the load (although currently this is not a simple task). This also applies to indices.
I know of a few large databases out there using PostgreSQL, but one I have access to is the TASS project (http://www.tass-survey.org [tass-survey.org]), which currently has about 28Gb of astronomical data.
The largest database I have is currently 5Gb, but that's primarily limited by disk space and not by postgresql.
Re:What about sybase period? (Score:1)
For that matter, are there compelling reasons not to use Sybase for anything? just curious...
--Siva
Keyboard not found.
sybase free? (Score:1)
particularly relavant is this line:
5. SUPPORT. These download Programs are provided on an "as is" basis and are unsupported. This Agreement does not entitle you to any maintenance or other services or any updates or new versions of the Programs.
--Siva
Keyboard not found.
The specification was Linux (Score:1)
But if you drop that specification, then I agree. Besides which I don't know if you would *want* to run a big database on Linux today.
Cheers,
Ben
Are you considering an Alpha? (Score:2)
Unfortunately none of the "big boys" have released for the Alpha so you would need to use something like PostGres.
Another tip. If you use ext2, increase the block-size. By default it is 1k, if you bump that to 4k you may see a performance increase and you will see a big improvement in fsck. Of course long-term the right solution to fsck problems is to use a journalled file-system. And so at the moment you may not want to use Linux for this...
(Give the penguin time. It is still growing up.)
Cheers,
Ben
user defined functions (Score:2)
--
two other alternatives (Score:1)
DISCLAIMER: I work for IBM, but not on Linux or DB2.
Re:Two Comments... (Score:2)
PostgreSQL and Oracle -- the only two choices. (Score:1)
So, let's look at the pros and cons:
Oracle:
Pro: This is THE RDBMS. Really. There is no RDBMS better than Oracle. Period.
Con: Expensive, like any other commercial RDBMS.
PostgreSQL:
Pro: Open Source.
Pro: Widely respected -- only Oracle has a more experienced codebase.
Pro: Actively supported.
Pro: Free.
Pro: In the words of Philip Greenspun, who is an unabashed Oracle fan:
"> The open source purist's only realistic choice for an RDBMS
> is PostgreSQL (see Resources for the URL). In some ways,
> PostgreSQL has more advanced features than any commercial
> RDBMS. Most important, the loosely organized, unpaid
> developers of PostgreSQL were able to convert to an
> Oracle-style multiversion concurrency system (see below),
> leaving all the rest of the commercial competition
> deadlocked in the dust. If you've decided to accept John
> Ousterhout as your personal savior, you'll be delighted
> to learn that you can run Tcl procedures inside the
> PostgreSQL database. And if your business can't live without
> commercial support for your RDBMS, you can buy it
> (see Resources for a link). (From a LinuxWorld article on AOLserver)
Cons: not as SQL92 or SQL3 compliant as Oracle
Con: no referential integrity (yet)
Con: absence of outer joins (right now)
Con: recovery is not quite there (look for the next version...)
PostgreSQL whips up on MySQL once the word 'transaction' enters the picture.
PostgreSQL's multiversion concurrency control (MVCC) insures that even in the presence of multiple concurrent transactions there can be no deadlock. With large databases, processing large inserts, the deadlock issue is a very real one.
And, most importantly, unlike many other so-called RDBMS's, PostgreSQL passes the RDBMS ACID test.
Lamar Owen
MVCC (Score:1)
Sure it is
MVCC is a Very Good Thing, and is really the right way to implement multi-user concurrency; remarkably the high-dollar databases don't do it, probably because doing so would be a major change from their locking-based legacy designs. (Interbase does it, BTW)
Re:MVCC (Score:1)
I read somewhere that Oracle can do a form of MVCC, but in that mode of operation it is drastically slower due to implementing MVCC in an inefficient mannet, something having to do with transaction logs.
(Of course, I don't recall the source, and it could be completely inaccurate.)
Can an Oracle guru comment on this?
Re:Let the database do it! (Score:1)
It seems to me that doing a big Oracle system (with the dozen or more recommended seperate drives) without RAID would be a bad idea from an uptime point of view. Even if you can recover to the last millisecond, the odds of one of those drives dying (and requiring that the system be down to rebuild) is just too high.
Use Oracle. (Score:1)
When 8.05 came out for Linux I installed it on my Linux server and I am very pleased with the product. Addmiditly I am bias because I use Oracle on HP/UX and NT on a daily basis.
I have used Oracle for about 6 years now and I am more and more impressed by this product every day.
It is a complex system, but you can get classes from Oracle, and from many other companies. If you don't like the cost/time involved in classes then just go to any book store and you will find tons of books on all the various aspects of using and administrating Oracle.
Oracle has many thing's going for it. Point in time recovery, massive scalability and it supports most hardware/software platforms.
It has years of real life usage behind it. I think the product has been around for 10-20 years now.
Another option is DB2. I don't have any personal experience with this product but I have heard nothing but good things about it.
Chris Kraft (krafter@zilla.net)
Oracle 8i on Linux (Score:3)
I would say that tablesize going over 2Gb is not a problem. the tables go in tablespaces which can be made up of many datafiles.
PL/SQL is useful for fast and dirty code, but lacks a little to C for performance. If you can master it, Oracle Call Interface with 'C' will produce very fast code.
I'd also recommend reading a good book both on Oracle Design and on Tuning. There's a couple by O'Reilly, oddly enough.
Re:Some good information... (Score:2)
As far as I know, MySQL only supports table locking, rather than the row locking of Oracle (and perhaps Postgres). We started using database logging for our Apache server, so that we don't eat so much CPU running analog - Statistics would just be SQL calls to the database.
We started with MySQL and ended up postponing the project - MySQL just couldn't keep up. The hardware isn't the problem, it was a DS/20 w/ 1 GB of RAM, running Digital UNIX 4.0F.
Since we have other projects, this one has been on hold for some time, waiting for one of us to port our logger to Oracle.
Just thought I'd mention my experiences with it. I'm not knocking MySQL - We still use it for a decent number of mostly reader databases, but for intensive writers - no way.
Disclaimer: I'm not a DBA, nor do I pretend to be. If anyone has information that contradicts this, it's entirely possible that I'm wrong.
Solid (Score:2)
My experience is that Solid is a very robust, easy to set up, low maintenance RDBMS (with referential integretity checks, transactions, etc.) that isn't open source. They seem to now be fancying themselves a new Oracle so I'd just stick with Oracle if you need something out of this class. Kinda sad because Solid has been around with Linux since the early days as a low cost solution, but new management has come in determined to tear the company down, it would appear...
DB2 Pricing (Score:1)
Stonebraker (Score:1)
Re:Linux Disk Management/Journalling (Score:2)
Using a journalling file system for the partition holding database tables will accomplish pretty much nothing (but it won't do much harm either).
Journalling file system in general don't help much with data integrity either; the kinds of guarantees they usually make are minimal, they don't protect you against many sources of data loss, and they don't even guarantee fast system startups.
Re:Are you considering an Alpha? (Score:1)
We manage shitloads of data running on Oracle 7.3.3, 7.3.4 and 8i all on DEC/Compaq machines.
Runs fine. I second the call to bring AdvFS and LSM to Linux. Vinum (for FreeBSD) is good, but AdvFS is the real deal.
Re:Why not Informix? (Score:1)
Transactions with rollback (we mainly do SELECTs, and because we don't do transactions, we can be much quicker on everything else). We will support some kind of atomic operations on multiple tables, though. Currently atomic operations can be done with LOCK TABLES/UNLOCK TABLES but we will make this more automatic in the future.
On the other hand, at the O'Reilly open source conference David Axmark told me they eventually plan to support the complete ANSI92 SQL spec. So in this case 'not now' may not be 'never.'
Beating on a drum I have worn out doing so, not everyone understands what database people mean when we say transactions. PHBs and product specifiers, particularly. It really just means support of the SQL standard COMMIT/ROLLBACK feature. Operationally, this means a lot of heavy hauling for the database; just ask an Oracle engineer about how they do it (as I did once).
If you don't really really need "transactions," it's worth thinking carefully about how to turn off the feature in the DBMS you're using (some do allow that), or use one like MySQL that doesn't.
MySQL's developers made a deliberate choice here; it optimizes the program for some applications and rules it out for others.
--------
Re:Some good information... (Score:1)
No, the page is still there (Just reloaded...)
And No, it isn't the same page...
With the http://www.mysql.com/crash-me-choose.htmy you can choose wich databases to compare, thus reducing the size of the result-page and enhancing the readibility drasticly.
Some good information... (Score:2)
http://www.mysql.com/crash-me-choose.htmy
Take a close look at the MySQL solution's, they really rock at large databases!
Re:4D (Score:1)
Re:better get a mainframe (Score:2)
For that size of file growing that fast, I doubt that an I*86 would be the right answer. But it depends... given low use and tolerance for waits, it could be.
Backups could be a real drag though!
Let the database do it! (Score:2)
Last I heard, maybe it's changed, Linux didn't support raw partitions.
Re:Perhaps Terradata (Score:1)
largest Teradata database, and write code against it so I'm a little
biased here), but, unless you are hitting 60 TB's or so, I'd look
at a bit more userfriendly package (Oracle 8i, or DB/2 or
Informix (Informix's drivers byte,however) I would reccomend taking
a serious look at Oracle 8i (forget the stuff about it being
so new that it's buggy, take a look at the highend customers
using it, and the comments they have about it) Quite a bit of the
Online stores and datawarehouses I've seen have used 8i
with super success. Be aware however you may have to
balance cost-vs-performance unless you happen to have an
extra couple (or more) grand to toss around. I'd try a demo of all the products,
and pick which one seems right to you. I would however stay away
from MYSQL (you *MUST* have good transaction behavior and
good cursor support for multi-gigabyte databases, unless you happen
to enjoy downtime.....) I've never toyed with any of the other
freebies, so I can't comment on them.....
Tadghe
Re:What about sybase RAW IO ? (Score:1)
Not quite. You can download RPMs of ASE 11.0.3 for development and evaluation purposes, but when you actually implement an application with it the license fees are the same as the NT version - about $300/user or a flat $50K if the app touches the web.
Oracle's fees are similar but they are pushing use of "application service providers". Basically web hosts that have licensed Oracle instances that you can use for a premium.
BTW, if you want a free set of Oracle Development CDs, check out:
http://www.oracleanswers.com [oracleanswers.com]
Register for the Atlanta or Birmingham seminars. They are fully booked - you won't get to go. But Oracle will send you a complimentary set of CDs for free
Re:Oracle 8.0.5 not 8i! (Score:1)
And I think I should try to find another forum for this discussion. This is a VERY important topic for many Oracle folks...
My big question about this post is:
When you say it blows chunks, are you saying SQL*Plus blows chunks or does the server instance actually give up the ghost? I'll have to try it here at work (I think we've got an 8i instance around here someplace...)
Re:What about sybase RAW IO ? (Score:2)
Oracle 8.0.5 vs. 8i? (Score:2)
We have been going around in circles trying to decide whether to use Oracle 8.0.5 or 8i. As far as I can tell, the only significant advantage to 8i is the JVM. But 8i requires a minimum of 128MB RAM and 256MB swap without the JVM and 256MB RAM and 512MB with the JVM. Note: those are the minimums. And Oracle's not kidding. Don't even try firing up an 8i instance with less than 128MB RAM. Ultimately, the memory requirements will be satisfied. I suspect we'll spec out a Sun Ultra20 with about 1GB of RAM. But We don't want to invest a fortune in the development hardware.
Oracle 8.0.5, on the other hand, runs nicely with at little as 32MB of RAM. Plus, the install is CUI - whereas 8i requires X (I usually don't install X on my servers).
Besides a few cool new features in PL/SQL, does anyone else have any experience with 8.0.5 vs. 8i that would help sway me one way or the other? Right now we are pretty much set on using 8.0.5...
Re:Two Comments... (Score:2)
Re:IBM DB2 for Linux? (Score:2)
Large Databases on Linux! (Score:1)
When it comes to very large databases, Oracle is a good choice. I have run databases of about 5GB on
NT without much problem! And still larger ones on
Solaris. It is very rugged. Only thing is that as
the database grows you will have to do lot of tuning to keep the performance to acceptable limits.
Postgresql should be a good choice for a secondary database. ie, if u can use it for day
to day data collection and put the collected data
periodically to Oracle. Postgres also supports integrity constraints and quick scripting now.
Oracle v. PostgreSQL? No contest. (Score:1)
I have used both, and I have lots of problems with PostgreSQL. Large Object handling in Psql is bad bad bad. I get errors from the database all the time. On the other hand I have dozens of Oracle 8 databases running night and day with no problem.
Oracle has a huge learning curve though. Consider Interbase, it runs on Linux and is very good. It is so simple to admin, no real maintence required.
Re:Let the database do it! (Score:1)
Oracle does not recommend cooked file db's AFAIK, unless used in conjunction with Veritas vxfs.
Re:Let the database do it! (Score:2)
Reliability in database consistency is crucial. Put a RDBMS on a cached filesystem and you are asking for trouble. When writing to a raw partition, a write is actually committed to disk. When writing to a cached partition that guarantee cannot be made. While there are some filesystems that support specific database engines (vxfs from Veritas comes to mind - it supports Oracle and Sybase), putting any RDBMS on any OS's native filesystem is a Bad Thing(TM).
Database choices... (Score:3)
Even if you limit your choices to one OS, you still have at least five or six great DB platforms.
The choices you've given are by no means the only ones you have, and the reasons you give for narrowing down your choice of engines seem to be pretty sparse.
For instance, Sybase ASE and ASA both support referential integrity constraints. ASE's are more limited than ASA's, but ASA has RI checks that are comparable to what I've seen of Oracle 8i.
All of Sybase's products have T-SQL, which are their programmatic extensions to SQL92. Sybase has a robust C API for writing server extensions.
Sybase ASE is fully capable of handling the data load you've specified. So is Oracle's enterprise-scale product. So are some of Informix's products.
Interesting tidbit about Postgres - parts of it found its way directly into a product called Illustra. Illustra was bought lock, stock and barrel by Informix. Informix has in one of their products the object-polymorphism you crave.
From the PostgreSQL Guide:
Michael Stonebraker spun Postgres into Illustra. For a while after the Informix buyout he was grafted into their executive management. I dunno if he's still there, since Informix doesn't seem to list that information on their site.
Informix still has grants at UCB as well as many, many other colleges. (source) [informix.com] Some of those grants sound interesting...
Now, I realize that the subject was "Linux Databases", but there are *solid* RDBMSs that run on the various BSD's and commercial *nix, too. I mention this because of the large scale data needs. I'm not sure that Linux running on any PC hardware is capable of supporting the amount of data you are looking at. (Notice I said PC hardware...)
Because of that, I'd suggest a commercial *nix, running on a server-class machine and an engine from one of the "big three" database vendors: Sybase, Oracle or Informix.
All of these vendors offer their enterprise-level engines at very reasonable prices, since you are buying at the per-seat level in most cases.
Of course, I'll likely get flamed to hell and back for the above statements because I'm suggesting commercial, non-GPL engines running on commercial, non-GPL OS's on top of hardware that wasn't built by stuffing armfuls of parts from Fry's into a whitebox ATX case.
If I were stuck cobbling things together because of budgetary constraints, I'd go with Linux running on a whitebox and use the free Sybase ASE installation that bundled with RedHat.
Re:Linux Disk Management/Journalling (Score:1)
Unfortunately, there are not any journalling options available for Linux yet. Stephen Tweedie's ext3fs stuff is only barely in alpha, and SGI hasn't gotten very far through porting XFS. For now, if you're using a database that goes through the filesystem, you're stuck with fsck.
Re:mysql? - have had problems... (Score:1)
Temper this with the fact that we are using a JDBC driver to access it that might be doing "bad things", and the fact that it seemed to occur during development. I'm guessing it was related to killing processes (during debuggin) and it not cleaning up dropped connections properly. (Lack of transactions, maybe?)
Re:Use Oracle. (Score:1)
Re:Some good information... (Score:1)
http://www.postgresql.org/docs/postgres/mvcc.htm
Re:Let the database do it! (Score:1)
DB2 all the way (Score:1)
If you want to do big database stuff, you should look at who does it well. The US Patent Database [164.195.100.11] runs on DB2, which is the largest public database in the world. (I think the numbers were 15 Terrabytes Compressed Data last time I read up on it)
Free 60 trial downloads can be snatched up at this fun address [ibm.com].
My main feeling that this is a better option over something like MySQL is that if you are really shuffling arround that much data, you need a really robust transaction control engine. Last time I worked with MySQL, it was fast, but lacked real transaction control. This may have been added by now, as I haven't checked of late.
Re:Um, is Linux really appropriate for such a (Score:1)
So yes, journalling would be a good idea, but it would also be very sensible to invest in a regular (weekly? daily?) archival solution.
He could always use postgresql, pg_dump | gzip, and then drop the tables, every night
Re:mysql? (Score:2)
From the PostgreSQL Admins' guide (http://www.postgresql.org/docs/admin/manage-ag17
"Since Postgres allows tables larger than the maximum file size on your system, it can be problematic to dump the table to
a file, since the resulting file will likely be larger than the maximum size allowed by your system.
As pg_dump writes to stdout, you can just use standard *nix tools to work around this possible problem:
Use compressed dumps:
% pg_dump dbname | gzip > filename.dump.gz"
etc.
As regards postgresql being "as grown up as Oracle", erm, it's certainly a viable solution, and doesn't suffer from the bloatware problems of Oracle.
However, you might well find that it's not as fast as Oracle or MySQL for adding lots of events as rows in the table quickly - it all depends.
Try it out and see
Re:Some good information... (Score:2)
Rather long-winded but it sums up feature support in ODBC fairly well, I think.
What I'm saying is, there are one or two bits & pieces in the results that I believe are artifacts of the chosen ODBC driver, not of postgesql (which is the one I was looking at most).
And yes, I'd expect some confusion over ODBC's DATETIME type being merely 'DATE' to Oracle, especially if the script was written to look for only one-to-many relationships the wrong way round, as it were.
Re:Be aware of the complexity of Oracle (Score:1)
If you have access the direct link is http://support.oracle.com/ml/plsql/knowledgebase.
If you can't get access to it I'll be summarising something similar on a new web page about Oracle that I'll be creating in the next couple of weeks. Send me your email and I'll drop you a note when its ready.
Basically it discusses keeping objects grouped into tablespaces by size and enforcing keeping the same initial and next extent sizes for all objects in a tablespace. It also goes some way into pointing out why the 'my tables in 100 extents' panic is a myth.
ZamZ@mzm1.demon.nl is my email BTW
Be aware of the complexity of Oracle (Score:5)
As well as knowledge on tuning the database you'll have very specific Oracle SQL tuning to do. I've been doing this type of thing for a number of years now and I still don't know 100% of what is needed - there are always knew features to get aquainted with.
In saying that there are some good resources to get you started. Have a look at http://www.orapub.com for white papers and before you even start to lay the database down see the Oracle white paper (from their website) 'How to stop defragmenting and start living'.
You should also take into account that 8i is a new-ish release of Oracle. Keep an eye on the bug lists for it and make sure you won't be impacted by anything.
Apart from the warnings, I've run around 60G Oracle db's on Linux and as long as your server is beefy enough you should have no problem.
One big thing, if you do go down the Oracle route have t-shirts made up with 'I've looked at v$system_event today, have you?' written on them and wear them. If there is one way of finding what your performance problems are/will be its this table. The only real exception is bad SQL which you need to trap in other ways
Re:MVCC (Score:1)
Interystems' Caché (Score:4)
Most people aren't familiar with Caché from Intersystems, but if you're building a very large, sparsely populated transactional database (like an electronic medical record, for instance), it's at least an order of magnitude faster than Oracle, even after doing unnatural things with Oracle. (This info came directly from a major EMR vendor that benchmarked Cache and all the usual suspects and found nothing in the same league for this application. According to his tests, Caché was 30x faster than Oracle out of the box - Intersystems' website claims Caché is 20x faster than RDBMS competitors.)
The technology is quite interesting in some regards: It is very mature, having its roots in the old MUMPS or "M" system (about as old as Unix), but has been updated quite nicely as time has passed to provide quite capable SQL and object-oriented interfaces, even though it is neither an RDBMS or an OODB internally.
Another nice feature is its distributed caching protocol, which allows you to build a logically huge database server piecewise from a number of distributed servers. A number of large installations, including Boston Children's Hospital (IIRC), use this for their EMR - last I heard, BCH's database was spread over 100-150 servers. This would seem to fit nicely with the Linux way of doing things.
It runs on Linux and a free download is even available on thier website, or they'll send you a CD for the asking if you're one of the unfortunates stuck on the end of a phone line.
You can find Intersystems at http://www.intersys.com [intersys.com], not exactly the world's most intuitive URL, so I thought I should include it here...
Re:Linux Disk Management/Journalling (Score:2)
Oracle tablespaces are, in effect, a lightweight, special purpose file system. By keeping the transaction logs and data tables on different tablespaces on different devices, Oracle can gracefully handle the failure of any one hardware device, allowing you to move the affected data to a different device, with NO DOWN TIME. Oracle does a better job of handling this than any operating system filesystem, so running the tablespaces on top of a filesystem just adds bloat.
Speaking of bloat, people who call Oracle "Bloated" need to examine their definition of "bloat". Let's say you have requirements x and y, and piece of software has features X and Y that satisfy them. It also has feature Z that satisfies some other requirement z that you don't have. That doesn't make it bloated, it only means that it might be overkill for your application.
Re:Choose Life, Choose Oracle? (Score:2)
Not that Java doesn't have a place in the database engine, but I'd be uncomfortable using it for everything. Isn't the VM is going to be a problem for performance critical uses like triggers?
Re:Choose Life, Choose Oracle? (Score:2)
What's wrong about PL/SQL is that it doesn't provide anything in the way of object abstraction or a whole buch of other things you want in a full blown application development context. But of course, that's a can-o-worms.
It sounds like you're using PL/SQL outside its reasonable problem domain. Web integration? If God wanted us to use PL/SQL for this, he wouldn't have invented Larry Wall.
I like Java too, but I don't want to have to use it everywhere. Do you really want to have every statement in all of your triggers run through the Java VM? What happens when you update a million rows in your database, and get several million trigger executions? Think of all the memory allocation and garbage collections.
That said, having a language like Java when you do need it is great. Choice is good.
Re:SQL 7 (Score:3)
Re:Be aware of the complexity of Oracle (Score:3)
But... if you aren't constantly under the gun from cranky PHBs, tuning an Oracle database can be _fun_. If you've never worked on a system this configurable before, it's like you were a model rocketer and suddenly were given a chance to fire off a Saturn V. Chances are it will blow up on the launchpad, but if you get everythign right the results are spectacular.
Re:4D (Score:3)
4D has a number of major problems.
First, it is not relational in a technical sense; that is it does not allow for non-procedural programming. It uses a kind of foreign key structure to "relate" (ugh) one table to another, but the programming style is based on traversing links, and it implements that badly (imagine programming in a system which only allowed you one cursor). The programming style this enforces is much more like a hierarchical or network database. The lack of cursors means that 4D programs show poor modularity; whether any piece of code works depends on the global state of the system. This results in code that behaves inconsistently. Furthermore the weak object model tends to encourage use of global variables to control system behavior.
4D was a database designed by people with only a vague understanding of database technology, and this shows by their confusing tedency to use technical terms like "project" (not as in project management but as in "project an image") but in ways completely unrelated to their accepted meaning.
The 4D file format is at once highly fragile and stunningly slow. Past work I've done with recovering crashed 4D database indicate that it depends critically on internal file pointers, which tend to get corrupted. The internals are not documented, and the tools ACI gives you are utterly inadequate. You absolutely _must_ buy third party tools from people who have reverse engineered the structure, and you absolutely must run these any time the database has terminated abnormally. These tools are very slow. So after any abnormal termination, you can count on much down time. Whats worse is that the combination of slowness, and the lack of anything like SQL, means that designs almost always have to carry acucmulator fields in tables that must be updated (to keep monthly totals etc). So in addition to running the third party tools, you have to gin up your own tools to fix these fields.
The 4D developer environment is one of the most poorly thought out I've ever seen. You MUST lay out your database in a graphical layout window, and "relate" your tables using a connection tool in this window. This works fine for three or four tables, but in a system with twenty, thirty or more tables it often takes minutes of painstaking mousing to figure out the "relationship" between two or three tables.
This points out 4D's real weakness. Things like the "current record" model of programming and the graphical table layout make it fairly easy to get started with a small system, but unless that system stays tiny (in complexity AND number of records) you are bound for trouble.
I could go on and on about the shortcomings of 4D; but I know some developers like it. Different strokes, I guess, but one thing is absolutely certain: 4D is a very BAD choice for large, mission critical applications.
Re: Linux Databases with Huge Tables? (Score:2)
No matter what you choose you will be spending a good chunk of money on hardware and maintenance staff. An good DBA with experience tuning the database you choose will be critical to success, as will an architect that can address the design issues *before* you invest in the development.
Most of my large RDBMS experience has been with Oracle 7.x and DB/2 on commercial SMP Unix systems. Both have performed quite well when properly tuned and administered. I've also used Sybase for mission-critical apps, but the ones I've worked on weren't that big (1-2GB total vs. a few hundred GB.) I do know of companies in the banking industry who use Sybase for large databases as, so I expect it could handle the load as well.
As others have pointed out, your key issues are likely to be backup and restore capabilities rather than a simple "which is better" issue.
I guess it's obvious that I wouldn't recommend an open source solution (Postgres, MySQL, et. al.) to a client in this situation. There are a lot of situations where I'd recommend open source, but large mission critical data repositories isn't one of them.
To me it's not an issue of whether an open source solution *could* handle the job -- it's that I don't have experience with deployments that *prove* it will work. I wouldn't consider products like MS SQL Server for the same reason.
Re:mysql? (Score:2)
The lack of referential integrity has already been mentioned. Far more critical is the lack of transactions! The ability to roll back is indispensible in a sensitive application. Also, Oracle has vastly more sophisticated tools for backup than MySQL has -- another essential element of mission-critical DB administration.
Oracle is bloated and expensive, and I certainly don't recommend it unless you really do need it. But sometimes you do.
Re:Be aware of the complexity of Oracle (Score:3)
That's quite true (but see my other post about shortcomings of MySQL). I can't help the original poster much because I have no experience with PostGres, but I can tell you that while Oracle is very powerful, the learning curve is very steep, and you'll need a skilled, experienced DBA to really get all of its benefits to work for you.
I have found few things in the world more frustrating than Oracle documentation, not because there's not enough of it, but because you can't see the forest for the trees. The first time I had to administer an Oracle DB I was overwhelmed, and there was almost nothing there to help you learn how to just get started. By now, I feel relatively comfortable with Oracle, but it's taken months of practice.
Linux Disk Management/Journalling (Score:2)
On databases this large it might still take an age for linux to start up again if it has to fsck the disks.
There are a few journalling options out there, and this is something that is *very* important to look at if you want to ensure data integrity.
Currently other Unix systems do have support for very large disk arrays.
The use of databases on Linux is pretty much a cert, but your big problem will be in disk handling.
Re:Some good information... (Score:2)
is found here... http://www.mysql.com/crash-me-choose.htmy
That page doesn't seem to be available. The closest match is:
S.http://www.mysql.com/crash-me.html
Which seems to contain the information.
Perl and Monitoring Oracle (Score:3)
There are a number of GPL products which use Perl + DBI. One of the best, IMHO, is Orac which also uses Perl/Tk and so provides GUI access from multiple platforms (Solaris, Linux, NT). Orac offers loads of SQL scripts to help with tuning, or just seeing the layout, of databases. It also provides realtime database monitoring, which is the current thrust for improvement of the tool. You can find it on CPAN, e.g. here [funet.fi].
Another monitoring tool, which is capable of emailing you when it's unhappy as well as putting up current status on a web page is Karma. This is still developing rapidly and is intended for Oracle on Linux. This can be found here [freshmeat.net] on freshmeat.
Hope this is of some use.
Re:Some good information... (Score:2)
But I find the results suspect (NB, I work with Oracle databases, so my opinions are biased too.)
Consider the date & time datatypes:
date, time, timestamp, timestamp with timezone, datetime, abstime, datetime (other version), abstime, reltime, smalldatetime, timespan, year.
Alledgedly, MySQL supports 6 of these, Postgresql 8, and Oracle only 1.
But that one, despite being called date, also includes the time down to the second. If you don't want the time ignore it.
Similarly, the function list seems to have been needlessly extended to demonstrate the extra functions in some databases (not necessarilly MySQL, but that's where this example comes from):
Oracle doesn't support FROM_DAYS or TO_DAYS (correct, it uses TO_CHAR or TO_DATE respectively, but they're not listed.) It doesn't include WEEKDAY either, but uses TO_CHAR again.
There are more mistakes like this which are not in Oracle's favour. But I don't care enough to rant.
--
Too stupid to live.
History (was:Database choices...) (Score:2)
Oracle 8.0.5 (Score:2)
Good install procedure for 8.0.5 on RH6 is on http://joradn.fortwayne.com/oracle/
We currently have around 3 tables with over 2 million rows on each table, and Oracle has so far performed to expectations...
Re:Some good information... (Score:3)
MySQL does not support transactions. The good point is that there is NO way to make a transactionnal database as efficient as a non transactionnal one. So MySQL is faster than PostgreSQL for many benches.
The bad point is that when you need transaction, you have to perform table level locks which are far LESS efficient than row level locks of PostgreSQL.
Raw Disks are for speed not size (Score:2)
Cliff, one thing you did not mention was in which dimension is the table "big". If you are talking about many rows, Oracle has table partioning that allows for very fast indexed searches on "billions and billions" of rows. If you have rows that exceed 2GB Oracle can handle that quite nicely using standard file systems on any platform. If you have data points that exceed 2GB you may have a problem that only a change in platform can address. This would require a blob data type and has limitations (2 GB I believe) when stored in the database. There is a bfile subtype of the blob data type that allows for any size data point but is then limited to the file system's max file size because Oracle stores the data as an actual discrete file. Yes, Oracle is not the easiest product to administer. But, it is intended for use with difficult to manage databases so has all the bells and whistles necessary to manage large databases and unruley or unwieldy data. The first time you restore the database to 16:42:13 June 12, 1999 (or any other point in time) you will realize the power of the restore features too.
PostgreSQL is a fine product but I do not believe that it is ready for this size of database in a mission critical application (I am assuming that you would not be considering collecting this amount of data for a non-mission critical app)
I am really surprised we didn't get the "my database is bigger that yours" statements that is typical of the oracle news groups. But then again we at /. are a finer breed (cough).