Linux Kernel Archives Struggles With Git 45
NewsFiend writes "In May, Slashdot discussed Kerneltrap's interesting feature about the Linux Kernel Archives, which had recently upgraded to multiple 4-way dual-core Opterons with 24 gigabytes of RAM and 10 terabytes of disk space. KernelTrap has now followed up with kernel.org to learn how the new hardware has been working. Evidently the new servers have been performing flawlessly, but the addition of Linus Torvalds' new source control system, git, is causing some heartache by having increased the number of files being archived sevenfold."
This is normal. (Score:5, Insightful)
Re:This is normal. (Score:2)
Re:This is normal. (Score:1)
Re:This is normal. (Score:2)
Answering your question, kernel.org holds a lot of stuff, not only kernel related things, but everything from distributions to various utilities, so yes.
Re:why blame git? (Score:1)
Their two problems are:
(1) rsync takes a long-ass time to run when it has to compare a crapload of files. The solution they're working on is to build a better rsync that saves its state.
(2) The i386 architecture sucks. FTFA: "master.kernel.org is still an i386 machine. It's constantly hurting for lowmem since the dentry and inode caches can only live in lowmem." The solution for that is to upgrade master.kernel.org to a 64bit machine.
Re:why blame git? (Score:1)
This remark 'i386 sucks because there is little lowmem' is stupid ... let them improve linux to need less lowmem. Why would you always want dentry and inode's in lowmem anyway; it think they could be swapped into highmem just a well. That will still be faster then not having them in mem...
using a 64bit arch is a workaround, a solution would be to fix their memory management. I you don't need huge user processes, just increase kernel address range to 3Gb, and userspace to 1Gb. No more problems with lowmem
Re:why blame git? (Score:2)
The whole point of a database is to isolate you from the actual representation of the data on disk and to make querying for data easy, so you don't have to parse those files at all! For disaster recovery, I pity you if you prefer to try and extract the data manually from the files on dis
Re:why blame git? (Score:5, Insightful)
*snicker*
*laugh*
*great rolling peals of laughter*
*sigh*
*wipes tear from eye*
You haven't done much work that actually required databases (or that would massively benefit from a relational programming model). The whole point of moving from flat files to a database is so that the data is stored already parsed, recovery is done by a tool provided by the db vendor, and manipulation is done within rules (constraints) that prevent "programming accidents" (bugs) or "pilot error" (users) from breaking relationships between parts of your data. That eliminates most of the need for recovery right there.
CM systems get much more powerful and IMHO, simpler, when you start using a decent database as the backend. As for distributed work, there are plenty of good databases that inexpensively and easily fit onto any modern workstation (PostgreSQL is my personal favorite) that can act as a local backing store, giving you fully detached functionality and the benefits of a relationally organized system.
Regards,
Ross
Re:why blame git? (Score:1)
(that's developer-speech for a boring but feasible project that would make you shove your buzzword db-admin-speech up your arse)
Point #1:
A quick google search yielded a [territory.ru] few [sourceforge.net] links for "sqlfs".
Now, are you really talking about a filesystem implemented in a relational database? You're pretty confused if you think you contradicted what I wrote. That's exactly what I'm advocating, except that I'm advocating that this database-backed filesystem also be CM-aware.
For this particular file-centric appl
Re:why blame git? (Score:1)
Re:why blame git? (Score:2)
As for your assertion that tree-like data is a poor fit for relational programming, it's an issue of having a deeper understanding of relational programming (a "kind" of programming parallel to procedural or object-oriented programming). Trees fit perfect
Linus needs to add 2 more programs to Git (Score:2)
Re:Linus needs to add 2 more programs to Git (Score:1)
Re:Linus needs to add 2 more programs to Git (Score:2)
same reason I dislike Subversion (Score:2, Interesting)
Re:same reason I dislike Subversion (Score:1, Informative)
Re:same reason I dislike Subversion (Score:1)
reiser4 + VCS? (Score:3, Interesting)
touch bar
echo 'foo' > bar
revisions bar
output of revision history
cp bar/revision/1 bar-version-1.0.backup
granted yes, the storage requirements and cpu usaged might be horrific, but i think something like this is inevitable in file systems, and certainly i welcome the day it becomes a reality.
Re:reiser4 + VCS? (Score:2)
Really? (Score:2, Funny)
Re:Really? (Score:2, Funny)
File System Scalabilty? (Score:2)
Sounds like a software engineering issue.
Re:File System Scalabilty? (Score:2)
Re:File System Scalabilty? (Score:2)
Re:File System Scalabilty? (Score:3, Insightful)
Actually i was spending hours to grasp his ideas about GIT, it clearly shows that he gave it a lof of though. Actually i think another SCM already started integrating GIT code into their SCM.
Re:File System Scalabilty? (Score:3, Interesting)
Next, I'm not ignoring speed you can scale a database system up infinitely large. Since database systems support acid transactions (i.e. line/file source code locking during transaction) you can have multiple merges going on at once and thus effective speed is much much better. For example Amazon.com uses Oracle as their backend. Think about the number of users
Well, what filesystem are they using? (Score:2)
On the flip-side, if kernel.org is using XFS, JFS, Reiserfs (I doubt they'd risk Reiser4 yet) or any other very high-performance filesystem, then maybe the problem is one of organization.
It is rare that you actually need large numbers of files holding very small amounts of data or metadata. What is probably wanted is a virtual layer that all
perhaps this might help (Score:1)
this interview with the maintainers has a comment from sombody who claims he asked by email and got the reply that ext3 is used
if thats not a good enough perhaps guessing that as "At this time, the servers run Fedora Core and use the 2.6 kernel provided by RedHat." they might be using ext3 that is the default.
Re:perhaps this might help (Score:3, Interesting)
Since the "smart" way to run such a server is to have the main FS on one disk and the data on another (this avoids tracking the head back and forth), the data partition can be just about anything.
Now, the fact that the maintainers have said they are using Ext3 is rather more convincing to me. Foolish beyond belief, but convincing. I would rather use a "less reliable
Re:Well, what filesystem are they using? ext3 OK (Score:3, Interesting)
We did some tests comparing reiser3, xfs, and ext3 with the dir_index option on 2.6 kernels. We were writing thousands (ok tens of thousands) of small files into a couple of directories (specialized app, you don't want to know.)
When directories got large, ext3 with the hash lookups (between 800 and 1500 creations per second on newish hardware) ran much faster than xfs, oh and several orders of magnitude faster than ext3 without the directory hashing.
See? I Told You So! (Score:3, Funny)
Filesystem? (Score:5, Interesting)
Re:Filesystem? (Score:5, Informative)
Ext3 vs. Reiser is not an issue here. FWIW, I use XFS on my mirror volume, and I have also noticed how the git repository increases load on my server. See the CPU usage graph [linux.cz] of ftp.linux.cz - look especially at the yearly graph and see how the CPU system time has been increasing for last two months.
The problem is in rsync - when mirroring the remote repository it has to stat(2) every local and remote file. So the directory trees have to be read to RAM. Hashed or tree-based directories (reiserfs or xfs) can even be slower than plain linear ext3 directories, because you have to read the whole directory anyway, so linear read is faster.
10 TB (Score:3, Funny)
Kernel sources take up, what, only a handful of gigabytes?
seven fold = 2^7? (Score:2)
Re:seven fold = 2^7? (Score:1)
http://dictionary.reference.com/search?q=sevenfol
Same reason a trifold wallet has three sections, not four.
I blame Linus (Score:1)
You're responsible for all the world's problems! The linux kernel, bitrot on my cds, war in Iraq, Guantanamo Bay, and now git!
Come on Linus, clean up your act!
(Sorry if this offends *anyone*)