Become a fan of Slashdot on Facebook


Forgot your password?
Mandriva Businesses

Question gzip Maven Jean-loup Gailly 95

Jean-loup Gailly is the author of gzip and, now, CTO for Mandrakesoft, purveyors of Linux-Mandrake. Jean-loup's home page tells you quite a bit about him, including some interesting peeks into his life beyond Linux and open source software. Please try to keep it down to one question per post. Submitted questions will be chosen from the highest-moderated. Answers will appear within the next week.
This discussion has been archived. No new comments can be posted.

Question gzip Maven Jean-loup Gailly

Comments Filter:
  • by Anonymous Coward

    Since so many things use zlib, and the algorithm and format hasn't changed (and won't change, thanks to RFCs 1950 & 1951), do you think it would make sense to have a hardware implementation of deflate/inflate, with hooks added to zlib to use it?

  • What ever happened with LinuxOne? The last I read their IPO was due on January 31. What it postponed, and perhaps why?
  • I think that the RPM package format relies upon cpio, as evidenced by the "rpm2cpio" utility.

    I think I've seen .cgz files lying around Red Hat's FTP server. I think they are in the install images as well.

    cpio is much more akward to use. Perhaps this poster should be sentenced to it for a few months?

  • alias xplod='zcat $1 |tar -xvf -'
  • I like Mandrake. But, one of the reasons I have had a hard time recommending Mandrake as a first time Linux distribution to friends, family, and co-workers... is concerns about quality.

    With RedHat, Caldera, and SUSE I have the impression that a distribution undergoes rigorous quality assurance testing before being released. Mandrake seems to put more emphasis on user interface issues, eye candy, and being there first as an early adopter of new features, functionality, and code.

    In the past, I've assisted with testing and debugging packages in the Cooker development distribution. The process by which Cooker was transformed into a release distribution happened behind the scenes and off the mailing lists.

    I'm curious as to why the process of creating a release version of a distribution doesn't happen in the same open development fishbowl in which development takes place? I've been somewhat surprised by failures of simple things like:

    • Bad or Missing package dependencies
    • Failure to completely uninstall packages
    • The large footprint of the smallest installable set of packages for a Mandrake Distribution
    Things which I would expect to have been covered in regression tests by a quality assurance group.

    Could you talk a little bit about the process of quality assurance and testing a new distribution at Mandrake? How many people on staff does Mandrake have working full-time in this capacity? Where do Mandrake's strengths and weaknesses lie in providing a high quality distribution (in the context of the common problems all distributions face)?

  • Mandrake hasn't been are repackaged version of RedHat since at least 6.0. Sure there is cross-pollination, but that's different. And since RedHat is the defacto standard to the pointy haired people, they work hard to make it 99.999% RedHat compatible.

    Also, in 7.0 they have 3 prepackaged installs (paraphrased): desktop, development, and server.

  • by Denny ( 2963 )

    Yeah, the joke's really on me there, I thought othello and go were the same thing... guess I should look go up online some more then, figure out what the diff is!

    Thanks for the enlightenment people :)

    # Using Linux in the UK? Check out Linux UK []

  • To save you asking questions that other KDE developers have already answered this week, you could try reading this story [] on Linux UK [] which is an interview with Mosfet (a KDE developer). Mosfet also works at Mandrakesoft...


    # Using Linux in the UK? Check out Linux UK []

  • Um, Iagno is the GNOME version of Othello (a.k.a. Reversi), not Go. The name comes from Shakespeare's play by way of GNOME name-mangling. :-)

    I agree, the pieces in Iagno are very nicely designed: it'd be nice to see them in a GNOME Go game.

  • Will we ever see a zlib 2.0? I'll buy you dinner if it happens this year. :-)

    (And for those who don't track Freshmeat, the zlib home page is now b/infozip/zlib/zlib.html []. Please check that first before reporting bad links on other copies.)

  • Gzip here (as every unix shop) is a great tool ... used daily by almost everyone.

    My only complaint is its single-threadedness.
    As multiprocessor systems have become commonplace and data sets become larger given cheap hardware, alot of folks find themselves gzipping enormous files ... and waiting ...

    Are you still active in gzip development?
    I haven't been able to find any info online about a multithreaded version of gzip and so I'd like to take this up as personal project.
    Is anyone (you?) working on this? How would you prefer people to contribute to gzip?

    -rob zXXt@jXbtrak.cXm (s/X/o for non-spam)

  • I think the next big breakthrough will be a compressor that can take a file with not much repetition of data (therefore hard to compress using current algorithms) and create a file with much more repetition in it (and perhaps larger) and then compress that down.

    Have a look at the paper describing the bzip algorithm; this is pretty much exactly what it does. The idea is that, with a bit of care and twiddling, you can partially sort the file so that similar bits go together, but in such a way that the sorting can be undone to get the original file back.

  • I don't think Ada is anything like as hideous as it's made out to be; Ada95 feels to me rather like a more friendly C++ (at last! Interface and Implementation files with dependencies handled by the compiler, rather than having to do explicit #include commands), though (at least in the version I've played with - the GNAT Win32 release at the libraries available don't seem as good as STL.

    It's possible that the Jargon File is referring to Ada83, which was apparently a good deal more hideous.
  • I haven't been able to find any info online about a multithreaded version of gzip and so I'd like to take this up as personal project.

    I have a feeling that the gzip algorithm is rather thoroughly serial and so would work very badly if multi-threaded at a fine-grain level; I suspect you won't get much advantage over just

    tar cf file.tar [files]
    Split file.tar into N pieces, where N is twice the number of CPUs you have
    gzip all N pieces in parallel
    tar up the gzipped bits

    Of course, this isn't going to work very well on streams; you'll have to construct the full tar file first. If you want to work with streams, you could do something hideous like sending the Mth block to file N%M before doing the gzip-in-parallel - a sort of N-way version of tee - but this'll disrupt locality horribly.

    Neither method will produce files compatible with normal gzip, which is another teeny little problem.

  • As someone who shifted to Linux-Mandrake from Redhat due to compilation optimizations and the various improvements made by the Linux-Mandrake team, I was flabergasted by the atrocious installer program in 7.0. Apart from the dreadful choice of colors, 'expert' mode is the most labour intensive install configurator I have ever seen. At this time I have ceased recommending Linux-Mandrake to any of my customers.

    Do you think the low quality of this install program is due to the fact that its developers lacked the feedback typical of OSS development, and how quickly is this application going to be overhauled?
  • Maybe you should use a font in which 1 and l look different.
  • The Unix phylosophy is: "make things do one thing, and let them do it right". IMO this is one of the most important causes for the power of Unix.

    And what's so difficult in using the 'z' switch to tar? Better to back to windows, then.
  • Hi!

    I would like to ask what made You to write the gzip utility ?
  • I think IBM is actually the major patent holder on arithmetic coding ( backs me up on that one).

    Also, although it can certainly vary, I wouldn't say that Arithemtic coding is faster than Huffman. It has traditionally been considered slower (although hardware advances may have negated this).
  • Linux-Mandrake has been quite good on GNOME support in the past (not great, but good), but it has always promoted itself as a KDE-based distro. I think that this was a wise choice in the past, as GNOME was not very stable until October GNOME. But now that a "real" stable version has been released, will Linux-Mandrake consider replacing KDE with GNOME as the default desktop environment in the future?. Considering how wonderful the Helix GNOME release is and also the work that Eazel is putting into the GNOME shell, I would be curious to see if this is even being considered.
  • Jean-Loup's homepage says: From 1981 to 1989 I participated in the construction of code generators and real-time systems for the Ada language within Alsys (renamed first Thomson Software Products then Aonix). I was a member of the Ada language design team.

    Ok, before anybody goes "eeeew, Ada!", let me say that it's impressive to see that Jean-Loup is has a long history of doing Important Things. Designing a language is certainly cool.

    However, the infamous jargon has this to say about Ada:
    A Pascal-descended language that has been made mandatory for Department of Defense software projects by the Pentagon. Hackers are nearly unanimous in observing that, technically, it is precisely what one might expect given that kind of endorsement by fiat; designed by committee, crockish, difficult to use, and overall a disastrous, multi-billion-dollar boondoggle (one common description wss "The PL/I of the 1980s"). Hackers find Ada's exception-handling and inter-process communication features particularly hilarious. Ada Lovelace (the daughter of Lord Byron who became the world's first programmer while cooperating with Charles Babbage on the design of his mechanical computing engines in the mid-1800s) would almost certainly blanch at the use to which her name has latterly been put; the kindest thing that has been said about it is that there is probably a good small language screaming to get out from inside its vast, elephantine bulk.

    I'm curious about Ada, yet completely ignorant (and thus neutral) regarding Ada. However there seem to be quite a few people out there who absolutely hate it. Could you enlighten us as to how you feel personally about the Ada programming language, or perhaps say a few words on behalf of Ada?

  • i thought gzip only handles .gz files...zcat and compress/uncompress are supposed to handle .Z...and is .Z LZW compressed ? or only LZ compressed ?
  • Whatever your 'magical' method or algorithm would be, it is *mathematicallly impossible* to have an algorithm that works on arbitrary files of say N bits and compresses each of them to N-1 or less bits. The mathematical proof is obtained by counting the number of files you consider as possible input (2^N) and the number possible output files (1+2+...+2^(N-1)=2^N-1). There are fewer target files than source files, so whatever algorithm you come up with, there will be two files that are 'compressed' to the same file, so you wouldn't be able to properly decompress.

    Of course you could put the original file into the decompression binary and simply compress to a single 1 bit, or to the empty file for that matter, but in that case you need a different decompressor for each compressed file, and hence you should add the size of the decompressor to the compressed size.

    Btw, Signail11 already did a very good job at explaining all of this, and he certainly understands 'the most basic tenets of data compression'. And nobody said you're an imbecile. It is very natural to think that all files must be compressible somehow, since in your and everybody's experience all files are compressible by at least a factor of 2. But those are text files and binaries and contain repetitive patterns. In general it is not possible to mangle data to create repetition, as mathematics shows.

  • Simply ecause each unix command should do
    one thing and do it well.

    If it really annoys you it is trivial to
    write aliases:

    alias mygzip='tar cvfz'
    alias mygunzip='tar xvfz'

    alias mybzip2='tar cvfI'
    alias mybunzip2='tar xvfI'

    And while you're at it you can call
    your files .tgz


  • Sparc and Alpha versions are already available AFAIK. Check their website.
  • Use the text install next time sir.
  • How do you feel about closed-source software incorporating support for zip/gzip files via ZLib? Is this what you originally intended? Is there any way closed-source developers can give back to the ZLib project (other than buying copies of Linux Mandrake? ;)
  • Shouldn't that be Jean-Stéphane?
  • Think youd better do a checkup on GNU tar... You can filter in/output through gzip with the -z option. To decompress a tar.gz file, you type
    tar -zxf arch.tar.gz
    Not very hard, is it?
    You can actually use aliases or scripts to filter through an arbitrary compression/decompression program, as well as automagicaly invoce the correct decompressor, I belive that is covered in the BZIP2 howto.

  • Hehe, I'm sitting here laughing my ass off imagining someone playing Othello and wondering why Go is known as the most complicated game in the history of mankind...

  • You seem to believe that data compression in itself is impossible... a great new algorithm could definitely be on the horizon. Not a takeoff on the Huffman algorithm like most lossless ones are or anything, but a completely new way of doing it, maybe someday someone will make a breakthrough with fractal compression (its in use now but it's not too fantastic and its usually used in lossy compression like RealVideo) or something along those lines, it's all a matter of being able to find patterns in the data and reference those patterns instead of repoducing them.

    There are limits, and no compressor can be universal (in terms that it can't have universal results on radically differing file contents). However, your assumptions about a 4-bit file only representing 16 possibilities is quite wrong. The worlds most limited but most efficient compressor is the one that takes a 1 bit file and repoduces one entire file from it if its a 0 or a 1, the whole files are just stored in the program and the program checks for a 1 or a 0.

    I think the next big breakthrough will be a compressor that can take a file with not much repetition of data (therefore hard to compress using current algorithms) and create a file with much more repetition in it (and perhaps larger) and then compress that down.

  • Doesn't Jean-loup translate to John-wolf? How did you get this name? Is there a story behind it? Is it because you're a hit with the ladies? Or (and I hope not) because you're a wolf-in-sheep's-clothing, i.e. a M$ spy?
  • When Mandrakesoft will be on the Nasdaq and the French Nouveau Marche ?
  • s/corel/mandrake

    Corel is great, Corel is not only a hype, Corel is easy to install, Corel install has a great partitioning tool, the download version of Corel is really a full package, after Corel had finished I wondered: Okay and when are we going to start installing hardware? Turned out it had already done that part...

    rm s/corel/mandrake

    While you prenounce Mandrake as "Correll" you write Mandrake7, cappice? And all the media-hype
    about "Correll" is true, besides the tiny mistake,
    that they keep saying it's a Canadian compagny, while it is French ofcourse..

  • And what's so difficult in using the 'z' switch to tar?

    I use SunOS 5.6 for coursework at school, and one thing that really annoys me is that the version of tar on the school's system doesn't have the z switch, which forces me to use two commands when only one should be necessary. In cases like that, it would be better to have a program that does archiving and compression in a single step.

  • Jean-Loup is a quite common French first name. As everyone know since the masterpiece "Godzilla", we, French-speaking people, all have a first name beginning with "Jean" ;))

    Loup doesn't exist as a first name by itself, only with the "Jean". And yes, loup means wolf in French.

  • LOL !

    I was expecting this question ;)

    Actually I don't live in France, but in Belgium. That's probably what saved me ! Jean-Stéphane... It sounds horrible !

    Uh-Uh... Yeah ! (Puff Daddy ruining Led Zeppelin, Godzilla Soundtrack.

  • neither gzip nor bzip2 use arithmetic coding. Why?
  • Trying to run KDE on a 486 is a recipe for a very slow computer. I don't like saying this, but Windows95 actually runs much faster than KDE/GNOME on older computers.

    Does anyone know why Linux desktop systems is astoundingly slow on a 486, but it runs faster than Windows on modern machines?

  • A good cover on the use of wavelets for still image compression and video compression is chapter 11 of Mallat's book, here: To avoid confusion, for those of you who don't know about that, lossy compression (usually the case for image, sound, video, and in general signal compression, jpegs norms and mpegs norms are 2 examples) is a completely different thing than lossless compression (as provided by gzip or bzip2). Wavelets are mostly used for lossy compression, yet some research are currently conducted for lossless compression with wavelets, but don't expect a drop-in replacement for gzip or bzip2. This may just be useful for compression of transient signals such as natural images, there are still a few areas in which photo-interpreters refuse lossy compression (this is less and less the case, even medical imaging is now opening arms to lossy compression for image storing).
  • I really like Ada and I consider myself to be a hacker. It does a good job of allowing you to represent the problem you are solving unlike C where people tend to employ lots of cunningly efficient tricks which don't add much to the readability of the code. I believe in using the best tool for the task and in many cases that is Ada while often it is something else like C, Perl or whatever.

    One particular area where I think Ada 95 really beats any other OO language I've seen is the way that separate constructs are used for encapsulation and objects. This means that objects appear as normal parameters to their despatching operations - none of this object.function(params) of other languages.

  • I have the "Walmart" (MacMillan) distro of Mandrake 6.5 and think it represents a wonderful value in software, but the clerk's (a linux user)response was "Wow! Someboby actually bought it!" Will Linux ever become "Everyman's" OS?
  • Mandrake 7.0 was nice to install... some confusion on the package selection... that is another issue though. I started with an oxygen install of mandrake... pre release beta... I thought to myself.. "THIS IS THE ONE!" The distro that will put a large dent in the MS dominated market. Oxygen seemd fairly stable. I was in the middle of downloading AIR..... Bam.. Mandrake 7.0 is released... So I download the mandrake 7.0 ISO. Ok... here is the problem. I installed it.. Got connected to the internet... was crusing.. thinking to myself.. "I can FINALLY Eliminate Windows !" then Bam.. KERNEL panic... everything went crazy.. video went bonkers... (is that really a word?) So I tried reinstalling.. thinking it wa ssomething I did...same thing... happened agin.. I tested my hard drive. I checked my other computer. Installed it on it.. Same thing. Everything was great then everything went to heck. after 12 tries... many hard disk tests. I gave up. I posted messages in the cooker message board. Seemed Nobody had an answer. I posted to other places. I got a response. Seems the fact I have an AMD K6 -2 400 and 128 MB of ram on my main computer and an AMD K6 233 with 64 MB of ram is the problem. My question is ... Is this truly the problem? Also.. Does mandrake 7.0-2 ISO fix this? I do have one thought that makes me cringe. Mandrake 7.0 was rushed to production so it could be released at the upcoming Linux conferences in February. Pleas say it isn't so. I downloaded mandrake thru a 56 k Modem 3 times. so don't say I haven't really tried. I also had a friend send my a CD he got thru a dsl connection. I thought maybe my versions were corrupt. Take Care Christopher De Long
  • Just finished reading my article. I did not do very well on my spelling and punctuation did I?
  • Unisys has contacted several open source package maintainers about the used of LZW. They have claimed that the patent not only covers compression but also decompression. As a result, authors of software that may only do decompression of LZW get harrassed by Unisys for expensive licensing or loose the ability to do decompression of LZW information. One popular case of this is the Xpdf package [] being forced into doing LZW indirectly. How exactly does gzip handle decompression of ".Z" files in a way that avoids the wording of the patent?
  • I have lost a number of files to bzip2, around the 0.9.0 mark, though I can't say for sure if it was a problem with bzip2, a disk problem, or something else entirely. (I was experimenting, as per usual, with numerous experimental and beta versions of critical packages.)

    I would say the warnings, these days, are probably more in the line of disclaimers than real warnings. (ie: Don't blame me if your machine turns into a mushroom.)

  • I was reading through the docs that come with the latest bzip2 source recently. They still have all of the "WARNING: BETA SOFTWARE ALERT" messages all over the docs talking about how the software *should* work, but has not been thoroughly tested and may cause data corruption or loss.

    Does anybody know how long bzip2 has been out? (Honestly curious) It's a damn good compression algorithm, even if it is slow. I've never had a single problem with data corruption or loss (with the one exception of a file that got trashed because the disk was bad, not the compression algorithm). I would think that if the bzip2 format is good enough to use to distribute the linux kernel to the world, it's probably good enough for every day files.

    Any other people with comments/thoughts on the stability of bzip2? Why all those warning messages - just because it isn't v1.0 yet?

  • When I was at the NYC Linux Expo last month, I spoke with one of Mandrake's head honchos. I don't remember his name, but he was a rather skinny guy who looked like he was in his 50's, with a mustache and a very thick french accent. Really nice guy, he kinda lamented that he didn't have as much time to code recently.

    He told me that LinuxOne had done a lot of things such as printed up stickers and promotional items with the Mandrake character on them, and also other stuff with the tophat and magic wand on it to promote LinuxOne type things. For those of you who were at the Expo, you'd probably think it was pretty brazen since LinuxOne's booth was right behind Mandrakes! It was, and it was true, they were handing some of that stuff out on Friday (the last day of the show).

    He said, (as I remember) "We would have given them permission to use them if they had just asked, but they never did". What he didn't say, (but that I got out of the conversation) was "Those guys have got a lotta fuckin' nerve to be doing that."

  • I was curious as to how you percieved big money and the compention vs. cooperation debate in the open source community. In other words, how much should be cooperative between you and your fellow linux distrobuters and how much should be compentition?

    And i know, one per post, but I've always wondered. Should big money support developers directly by hiring them, or by donating to development groups? I sure wish a big company would pay me to work on opensource projects from home, I could do a lot more work if i didn't have to go to my job every day ;)
  • That's easy: there's a fairly exhaustive patent space of arithmetic encoders. A quick search of [] on "arithmetic encoder" listed seven patents, including 4488143, which featured a convenient link to how to license the patent from IBM. The other 6 were owned by Canon (2), Mitsubishi (2), Lucent, and Sharp. IIRC, there were a couple others (I used to work in compression) owned by Sony, etc., that didn't turn up in this rudimentary (30-second) snapshot.

    This is my opinion and my opinion only. Incidentally, IANAL.

  • the bwt (burrows wheeler transform---the algorithm that gave us bzip/bzip2) is now about 5-6 years old, and appears to be heading for the big time, especially for large files (i.e. >2M tarballs), where it's very strong. in the areas where it's weak (small files, compressing real-time streamed data, etc) it appears to be making few inroads.

    what do you see it's future as being, in relation to gzip ?

    do you see any wide-spread use of large-model compression schemes (for example PPMD) ?

  • If I recall correctly, AT&T holds a patent on that. The original bzip used arithmetic encoding, but then the patent came to the attention of the author. He pulled the program, redesigned the algorithm, and released it as bzip2.

    Arithmetic encoding is no big deal, however. The author has stated (I forgot where) that it only gives you a 1-2% improvement in compression ratio. I think it also runs faster, but only by a similarly small margin.
  • When will a version of Mandrake for i386
    be released?

    Our user group has a number of 486 machines
    we'd like to put Mandrake on. Mandrake has the
    best 'out of box' desktop right now.
  • Is compression software in a category that inherently plateus quickly, so that significant further work simply isn't possible? Or is there some other reason, such as Real Life(tm) intruding and preventing any substantial development?

    Algorithms differ in what they offer at what price:
    • there are very fast compressors out there that result in average compression (lzop),
    • compressors with good ratios at high CPU and memory costs (bzip2)
    • specialized compressors that work only well on one class of data (e.g. XMill for XML files)

    Gzip is a good general-purpose compressor with the additional quality of being written in C and availalbe under the GPL.

    So you have to choose according to your needs.
  • > I think the next big breakthrough will be a compressor that
    > can take a file with not much repetition of data (therefore hard
    > to compress using current algorithms) and create a file with
    > much more repetition in it (and perhaps larger) and then compress
    > that down.

    That's precisely what the filters in the PNG graphics format do. By calculating differences between adjacent pixels, a new datastream is created that has more repetition in it, and that new datastream is handed to Jean-loup's DEFLATE engine. Because the PNG filters are reversible, the original datastream can be recovered after decompression.

  • The way you phrase your reply makes it sound liek you still don't even understand the most basic tenets of data compression. The idea is that if you have a body of data that holds, say, 16 separate bit-pair combinations, and they are repeated over and over, you only need 5 bits in the compressed file to represent the 16 which are produced for output. The idea of increasing the repetition in some reliable way was just an idea I had a few days ago, I'm not a charlatan or an imbecile, I've simply never read (and I've read a lot about data compression) anything that even thought about artificially creating repetition. In a simple run-length algorithm this could work. Just keep track of the position where you replaced a bit, and instead of having
    (9)(B)(1)(Y)(9)(B) you'd have (19)(B)(10)Y, there, compression. 6 bytes down to 4. That's just a general idea and not the idea I had at all (mine involved bitwise operations on a couple of bytes, producing 3 bytes from which the original 2 could be reconstructed, and I think the 3 produced would have more repetition in a large random set than the 2 originals alone, I might be completely wrong about that though).

  • "Whatever your 'magical' method or algorithm would be, it is *mathematicallly impossible* to have an algorithm that works on arbitrary files of say N bits and compresses each of them to N-1 or less bits. "

    I'm going to have to spell this out very carefully being as its the third time I have said it and no one seems to comprehend it. NO DUH! I have never claimed that any technique I may come up with or anyone else might come up with would be such a universal compressor. I didn't even say in my original response that the jumps in compression would be in universal compressors. You are right that no compression algorithm can be guaranteed to, say, reduce a file to exactly half of its original size. I've never thought, stated, or insinuated anything even remotely along those lines.

    I'm surprised you claim that its mathematically impossible to create artificial repetition in order to improve compression considring several things. Number one, I gave an example that proves that it is possible, at least in a simple way. Number two, as per a couple replies to an earlier post of mine in this thread (whihc you must not have read), PNG and bzip2 do exactly this. So apparently they're using a different set of mathematics than you.

    While it may be true that the average person sees that text files get real small with zipping and the average person takes about an hour to convince them that they can't recursively zip a file into 1 byte, I would think from my previous discussions mentioning various algorithms that it would become obvious that I'm quite well-versed in data compression (not in theoretical entropy sustaining junk, in real taking X number of bits and figuring a way to reconstruct them from a smaller set of bits).

    The advances in data compression will come like MP3 did (already a disproof that there will be no great advances, audio compression before MP3 was either inefficient or extremely lossy), taking advantage of the peculiarities of the data being compressed. Video compression still has a lot of room for improvement, and that improvement will be developed by streaming media companies. 3D point data, texturing files, text, etc... there are thousands of kinds of files. In the future there'll be one compression program that uses hundreds or thousands of algorithms to deal with them. it might even use the same extension for its files to further confuse people...

    The future is going to be fun. Everything is always getting better. Anyone who tells you the past was better is wrong.

  • Everyone knows MandrakeSoft was started as a French company, but I was wondering what kind of problems you encountered while trying to start a Linux based company in France.

  • Data compression IS impossible in the most general case, as paradoxical as that may seem. The best we can do are some heuristic algorithms that happen to reduce in size most of the inputs that such algorithms encounter in day to day usage. Lossless compression will and always must be limited by information theory; you cannot get more information out of a given output to a compression algorithm than can be represented by the set of all states of the output: a n-bit output to ANY reversible compression algorithm can have no more than 2^n preimages that when compressed produced such an output. This is not an empirical observation based on a specific compression algorithm now in use, but a theoretical one based on simple information entropy considerations.

    A lossless reversible compression algorithm that signficantly (indeed at all) reduces in length more than a minute proportion of all possible inputs (please contact me if you want a clarification of this statement; I'm handwaving the precise definition of the phrase minute proportion) is impossible. It will simply not happen. The pigeonhole principle is the most basic argument against universal compressors; a given string of bits cannot represent more *distinct* other string of bits than the total number of *distinct* states of that string of bits.

    Your "big breakthrough" is a standard claim made by a) well-meaning people who think they have a great new algorithm that revolutionizes data compression or b) con-artists who want to persuade people to part with their money for the next best thing. The key difference is that the first category is usually willing to learn.

    For more information, please read the comp.compression FAQ.
  • The Wheeler-Burrow block sorting compression algorithm implemented in bzip2 does not "create a file with much more repetition in it." Rather, it takes advantage of the *heuristic* observation that on *most* of the files that people compress, adjacent bit strings will tend to clump more with certain bit strings than others. The block sorting algorithm itself produces an output with the exact same number of each byte as the input to the algorithm; the next phase in the algorithm uses a simple most-recently-used filter to create an output that will most likely have a substantially lower number of set bits than the input. A simple Huffman code is then used to compress this highly redundant output.

    This notwithstanding, the simple fact remains that no algorithm can compress all possible inputs, regardless of what transformations are performed on it, out of simple uniqueness and distinguishibility considerations.
  • I accidently bought Mandrake Linux v6.0 a few months back, (Was advertized as Red Hat in the paper.), and was an instant convert. By far the easiest Linux to install. Now I want it on ALL my computers. So is there ever going to be a version for other than PC hardware? Sparc in particular would make my day.

    Keep up the good work.

  • I've used Mandrake as my only distro ever since I got into using Linux. I really like the i586-enhanced compiled binaries, but I really miss the ability to use Netware servers with an easy client/server application.

    Is Mandrake planning on including such apps the way Debian is? This would be another big plus for Mandrake, IMHO.

    Keep Mandrake coming!
  • In the near future, do you see a breakthrough in the efficiency of data compression... a new algorithm or technique that could improve current compression rates by orders of magnitude?
  • As far as modern compression algorithms go, gzip is one of the faster and more efficient around. Since increases in memory and processor speed seem to be coming along so often these days, do you think the algorithm could be modified to wring out a little bit of extra compression, or would we be better served by using considerably more resource-intensive compression schemes such as Burrows-Wheeler[bzip2] or PPM?
  • by Anonymous Coward on Monday March 06, 2000 @09:31AM (#1222742)
    I don't use gzip because of what I see as a major annoyance - it only gzips single files, so you have to use the ancient tar.

    Using tar makes things unnecessarily complicated. There is less support for tar around on non-UNIX platforms, and 'embedded compression/archiving' seems to cause great trouble for newbies who can just about handle WinZip and nothing more.

    If gzip is to become a truly viable alternative to patented zip, I think the .tar.gz should become a thing of the past.

    Remove the old legacy tape archive!

  • by emil ( 695 ) on Monday March 06, 2000 @07:15AM (#1222743)

    Would you think it wise to roll alternatives to the Lempel-Ziv algorithms into gzip to make other compression utilities less attractive?

    It seems that this approach is adopted by other applications (ssh uses multiple encryption engines, and TIFF has allowed several compression techniques for quite a long time).

    Would you support an effort to implement bzip2 within gzip? Do you think such a thing could be done while maintaining gzip's stability?

  • by Denny ( 2963 ) <slashdot AT denny DOT me> on Monday March 06, 2000 @08:01AM (#1222744) Homepage Journal

    I notice you are a keen Go player... the GNOME version of Go (Iagno) seems much more attractive to me than the KDE version (kgo). I was wondering what software you use to play games, or are you not really interested in the interface at your level of play?


    # Using Linux in the UK? Check out Linux UK []

  • by Denny ( 2963 ) <slashdot AT denny DOT me> on Monday March 06, 2000 @08:08AM (#1222745) Homepage Journal

    On your website, in the history section, you have a link to some information about pulsars...

    Were you an astronomy student, and if so how did you go from studying pulsars to CTO of a major Linux distributor?!?


    # Using Linux in the UK? Check out Linux UK []

  • by fReNeTiK ( 31070 ) on Monday March 06, 2000 @07:31AM (#1222746)
    I have read about an agreement between Mandrake and LinuxOne to create a chinese Linux development center. Did any good come out of this? Couldn't Mandrake's otherwise excellent reputation be damaged by such relationships?

  • by FigBug ( 69370 ) on Monday March 06, 2000 @10:07AM (#1222747) Homepage
    Why do you write code like this:

    z = (z = g - w) > (unsigned)l ? l : z;

    It makes your code almost impossible to read. Do you even know what this line does anymore?

  • by Signail11 ( 123143 ) on Monday March 06, 2000 @11:27AM (#1222748)
    z = (z = g - w) > (unsigned)l ? l : z;
    I hate to sound like I'm flaming you, but this is the standard idiom in C for addition with saturation. When (g-w) is larger than a certain constant l, z is assigned to that constant l, otherwise, z will retain its value.
    This code can also be written less efficiently (well, at least if your compiler doesn't have common sub-expression elimination) as:
    if((g-w) > (unsigned) l){
    } else {
  • by Signail11 ( 123143 ) on Monday March 06, 2000 @12:59PM (#1222749)
    The answer to your question is no. Very briefly, it is not possible to build a universal compressor that can reduce the size of all possible inputs, nor is it possible for a compressor to emit an output with less information content (ie. Shannon entropy) than the input. It is not possible to have a compressor that takes in, say all 1MB text files, and always outputs 10k compressed files simply because most 1MB text files contain more information than can be expressed in 10k.

    A much more intuitive argument is the "pigeonhole principle." Let's assume that there are 16 holes in a wall, to which each is associated with a message. It is impossible for 17 messages to each be uniquely associated with a hole because there are not enough holes avalible. A 4-bit file can only represent 16 different messages, regardless of what algorithm is used to compress the message...unless, that is, you don't care about the compression being reversible!
  • If you could compress anything and put it in your pocket what would you choose and why?
  • by sparkes ( 125299 ) on Monday March 06, 2000 @07:27AM (#1222751) Homepage Journal
    I see from your homepage you avoided patents when writing gzip, how do you feel about the current explosion of software related patents?
  • by emil ( 695 ) on Monday March 06, 2000 @07:34AM (#1222752)

    I guess that you have at least a little something to say about this.

    Is the 586 optimization enough to justify Mandrake's position? Are you especially proud of any of the architectural differences between the distributions (from what I have been told, the Apache-PHP layout is quite a bit different).

    How do feel about the steps that Red Hat has taken to change their distribution in reaction to yours?

  • by Tom Womack ( 8005 ) <> on Monday March 06, 2000 @08:09AM (#1222753) Homepage

    The Data Compression Book was an excellent reference when it came out, but there are some hot topics in compression that it doesn't cover - frequency-domain lossy audio techniques (MP3), video techniques (MPEG2 and especially MPEG4), wavelets (Sorenson video uses these, I believe, and JPEG2000 will), and the Burrows-Wheeler transform from bzip.

    Do you have any plans for a new edition of the book, or good Web references for these techniques? BZip is covered well by a Digital research note, but documentation for MPEG2 seems only to exist as source code and I can't find anything concrete about using wavelets for compression. The data is all there on the comp.compression FAQ, but the excellent exposition of the book is sorely lacking.
  • by Inquisiter ( 155042 ) on Monday March 06, 2000 @08:59AM (#1222754)
    When I think of a game like go or chess, I think that each player develops there own algorithm to beat their opponent. If you agree, what relationships or similarities do you see between your intrest in Go and your intrest in compression?

    Inquiring minds want to know.
  • by Aaron M. Renn ( 539 ) <> on Monday March 06, 2000 @07:09AM (#1222755) Homepage
    When is gzip going to provide (transparent) support for bzip2 files and the Burrows-Wheeler algorithm?

    Will BW be an algorithm option within the gzip file format itself ever?

  • It is a "truism" in the Free Software community that code should be released early and released often.

    However, much of the software you've written has started gathering a few grey hairs. Gzip, for example, has been at 1.2.4 for many, many moons, and looks about ready to collect it's gold watch.

    Is compression software in a category that inherently plateus quickly, so that significant further work simply isn't possible? Or is there some other reason, such as Real Life(tm) intruding and preventing any substantial development?

    (I noticed, for example, a patch for >4Gb files for gzip, which could have been rolled into the master sources to make a 1.2.5. This hasn't happened.)

  • by Uruk ( 4907 ) on Monday March 06, 2000 @07:55AM (#1222757)
    I noticed that you allowed the people who make the Winzip product to incorporate code written for Gzip. I think it's cool that you did that, because it would be horrible if winzip couldn't handle the gzip format, but at the same time, what are your thoughts about allowing free software code to be included in closed-source products?

    Just out of curiosity, (tell me it's none of my business if you want to and I'll be OK with that) did you receive a licensure fee from the company that makes Winzip for the code?

  • by Tom Womack ( 8005 ) <> on Monday March 06, 2000 @08:15AM (#1222758) Homepage
    The field of compression has been thronged with patents for a long time - but patents at least reveal the algorithm.

    What do you think of the expansion of trade-secret algorithms (MP3 quantisation tables, Sorensen, RealAudio and RealVideo, Microsoft Streaming Media) where the format of the data stream is not documented anywhere?

  • by Stephen ( 20676 ) on Monday March 06, 2000 @07:26AM (#1222759) Homepage
    The compression world has many patents, notably for Lempel-Ziv compression as used in GIF. What is your view on companies patenting non-obvious algorithms for such processes as data compression?
  • by drudd ( 43032 ) on Monday March 06, 2000 @07:16AM (#1222760)
    I am a happy owner of The Data Compression Book (2nd Ed). With the increasing availability of compression routines within libraries (Java's GZIP streams spring to mind), does this make your book a little unnecessary?

    Should software authors continue to write their own compression routines, or simply trust the versions available to them in library form?

    I can see some definite advantages to library code, i.e. the ability to upgrade routines, and having standardized algorithms which can be read by any program which utilizes the library.

  • As we all know, at first Mandrake was little more than a repackaged version of RedHat. That's changed a bit with the newer versions. My question is this: to what degree will Mandrake continue to differ from RedHat and will there ever be a "developer" version (i.e. one that is centered towards those who are a bit more technically competant)?

    Brad Johnson
    --We are the Music Makers, and we
    are the Dreamers of Dreams

In less than a century, computers will be making substantial progress on ... the overriding problem of war and peace. -- James Slagle