Question gzip Maven Jean-loup Gailly 95
Jean-loup Gailly is the author of gzip and, now, CTO for Mandrakesoft, purveyors of Linux-Mandrake. Jean-loup's home page tells you quite a bit about him, including some interesting peeks into his life beyond Linux and open source software. Please try to keep it down to one question per post. Submitted questions will be chosen from the highest-moderated. Answers will appear within the next week.
Hardware deflation? (Score:1)
Since so many things use zlib, and the algorithm and format hasn't changed (and won't change, thanks to RFCs 1950 & 1951), do you think it would make sense to have a hardware implementation of deflate/inflate, with hooks added to zlib to use it?
Re:LinuxOne (Score:1)
CPIO is used by lots of applications. (Score:1)
I think that the RPM package format relies upon cpio, as evidenced by the "rpm2cpio" utility.
I think I've seen .cgz files lying around Red Hat's FTP server. I think they are in the install images as well.
cpio is much more akward to use. Perhaps this poster should be sentenced to it for a few months?
Re:Why do you force the use of TAR? (Score:1)
Mandrake in a Fishbowl (Score:1)
I like Mandrake. But, one of the reasons I have had a hard time recommending Mandrake as a first time Linux distribution to friends, family, and co-workers... is concerns about quality.
With RedHat, Caldera, and SUSE I have the impression that a distribution undergoes rigorous quality assurance testing before being released. Mandrake seems to put more emphasis on user interface issues, eye candy, and being there first as an early adopter of new features, functionality, and code.
In the past, I've assisted with testing and debugging packages in the Cooker development distribution. The process by which Cooker was transformed into a release distribution happened behind the scenes and off the mailing lists.
I'm curious as to why the process of creating a release version of a distribution doesn't happen in the same open development fishbowl in which development takes place? I've been somewhat surprised by failures of simple things like:
Could you talk a little bit about the process of quality assurance and testing a new distribution at Mandrake? How many people on staff does Mandrake have working full-time in this capacity? Where do Mandrake's strengths and weaknesses lie in providing a high quality distribution (in the context of the common problems all distributions face)?
Hrumph... (Score:1)
Mandrake hasn't been are repackaged version of RedHat since at least 6.0. Sure there is cross-pollination, but that's different. And since RedHat is the defacto standard to the pointy haired people, they work hard to make it 99.999% RedHat compatible.
Also, in 7.0 they have 3 prepackaged installs (paraphrased): desktop, development, and server.
Re:LMAO (Score:1)
Yeah, the joke's really on me there, I thought othello and go were the same thing... guess I should look go up online some more then, figure out what the diff is!
Thanks for the enlightenment people :)
Denny
# Using Linux in the UK? Check out Linux UK [linuxuk.co.uk]
Here's one we made earlier! (Score:1)
To save you asking questions that other KDE developers have already answered this week, you could try reading this story [linuxuk.co.uk] on Linux UK [linuxuk.co.uk] which is an interview with Mosfet (a KDE developer). Mosfet also works at Mandrakesoft...
Regards,
Denny
# Using Linux in the UK? Check out Linux UK [linuxuk.co.uk]
Iagno is not Go... (Score:1)
I agree, the pieces in Iagno are very nicely designed: it'd be nice to see them in a GNOME Go game.
next-generation zlib? (Score:1)
(And for those who don't track Freshmeat, the zlib home page is now ftp://ftp.freesoftware.com/pu b/infozip/zlib/zlib.html [freesoftware.com]. Please check that first before reporting bad links on other copies.)
multithreaded gzip (Score:1)
My only complaint is its single-threadedness. ... and waiting ...
As multiprocessor systems have become commonplace and data sets become larger given cheap hardware, alot of folks find themselves gzipping enormous files
Are you still active in gzip development?
I haven't been able to find any info online about a multithreaded version of gzip and so I'd like to take this up as personal project.
Is anyone (you?) working on this? How would you prefer people to contribute to gzip?
-rob zXXt@jXbtrak.cXm (s/X/o for non-spam)
Re:Improved compression? (Score:1)
Have a look at the paper describing the bzip algorithm; this is pretty much exactly what it does. The idea is that, with a bit of care and twiddling, you can partially sort the file so that similar bits go together, but in such a way that the sorting can be undone to get the original file back.
Re:What About Ada? (Score:1)
I don't think Ada is anything like as hideous as it's made out to be; Ada95 feels to me rather like a more friendly C++ (at last! Interface and Implementation files with dependencies handled by the compiler, rather than having to do explicit #include commands), though (at least in the version I've played with - the GNAT Win32 release at www.gnat.org) the libraries available don't seem as good as STL.
It's possible that the Jargon File is referring to Ada83, which was apparently a good deal more hideous.
Re:multithreaded gzip (Score:1)
I have a feeling that the gzip algorithm is rather thoroughly serial and so would work very badly if multi-threaded at a fine-grain level; I suspect you won't get much advantage over just
tar cf file.tar [files]
Split file.tar into N pieces, where N is twice the number of CPUs you have
gzip all N pieces in parallel
tar up the gzipped bits
Of course, this isn't going to work very well on streams; you'll have to construct the full tar file first. If you want to work with streams, you could do something hideous like sending the Mth block to file N%M before doing the gzip-in-parallel - a sort of N-way version of tee - but this'll disrupt locality horribly.
Neither method will produce files compatible with normal gzip, which is another teeny little problem.
Tom
Mandrake 7.0 Installer (Score:1)
Do you think the low quality of this install program is due to the fact that its developers lacked the feedback typical of OSS development, and how quickly is this application going to be overhauled?
Re:Nasty Code (Score:1)
Re:Why do you force the use of TAR? (Score:1)
And what's so difficult in using the 'z' switch to tar? Better to back to windows, then.
Question (Score:1)
I would like to ask what made You to write the gzip utility ?
Re:Patent issue (Score:1)
Also, although it can certainly vary, I wouldn't say that Arithemtic coding is faster than Huffman. It has traditionally been considered slower (although hardware advances may have negated this).
Linux-Mandrake and GNOME (Score:1)
What About Ada? (Score:1)
Ok, before anybody goes "eeeew, Ada!", let me say that it's impressive to see that Jean-Loup is has a long history of doing Important Things. Designing a language is certainly cool.
However, the infamous jargon has this to say about Ada:
A Pascal-descended language that has been made mandatory for Department of Defense software projects by the Pentagon. Hackers are nearly unanimous in observing that, technically, it is precisely what one might expect given that kind of endorsement by fiat; designed by committee, crockish, difficult to use, and overall a disastrous, multi-billion-dollar boondoggle (one common description wss "The PL/I of the 1980s"). Hackers find Ada's exception-handling and inter-process communication features particularly hilarious. Ada Lovelace (the daughter of Lord Byron who became the world's first programmer while cooperating with Charles Babbage on the design of his mechanical computing engines in the mid-1800s) would almost certainly blanch at the use to which her name has latterly been put; the kindest thing that has been said about it is that there is probably a good small language screaming to get out from inside its vast, elephantine bulk.
I'm curious about Ada, yet completely ignorant (and thus neutral) regarding Ada. However there seem to be quite a few people out there who absolutely hate it. Could you enlighten us as to how you feel personally about the Ada programming language, or perhaps say a few words on behalf of Ada?
Re:LZW/".Z" decompression not covered by patent? (Score:1)
Re:Improved compression? (Score:1)
Of course you could put the original file into the decompression binary and simply compress to a single 1 bit, or to the empty file for that matter, but in that case you need a different decompressor for each compressed file, and hence you should add the size of the decompressor to the compressed size.
Btw, Signail11 already did a very good job at explaining all of this, and he certainly understands 'the most basic tenets of data compression'. And nobody said you're an imbecile. It is very natural to think that all files must be compressible somehow, since in your and everybody's experience all files are compressible by at least a factor of 2. But those are text files and binaries and contain repetitive patterns. In general it is not possible to mangle data to create repetition, as mathematics shows.
Re:Why do you force the use of TAR? (Score:1)
Simply ecause each unix command should do
one thing and do it well.
If it really annoys you it is trivial to
write aliases:
alias mygzip='tar cvfz'
alias mygunzip='tar xvfz'
alias mybzip2='tar cvfI'
alias mybunzip2='tar xvfI'
And while you're at it you can call
your files
Steve.
Re:Cross platform Mandrake (Score:1)
Re:Mandrake 7.0 Installer (Score:1)
zLib/zip/gzip in Closed Source software (Score:1)
Re:This is actually a serious question! (Score:1)
Re:Why do you force the use of TAR? (Score:1)
tar -zxf arch.tar.gz
Not very hard, is it?
You can actually use aliases or scripts to filter through an arbitrary compression/decompression program, as well as automagicaly invoce the correct decompressor, I belive that is covered in the BZIP2 howto.
LMAO (Score:1)
Esperandi
Re:Improved compression? (Score:1)
There are limits, and no compressor can be universal (in terms that it can't have universal results on radically differing file contents). However, your assumptions about a 4-bit file only representing 16 possibilities is quite wrong. The worlds most limited but most efficient compressor is the one that takes a 1 bit file and repoduces one entire file from it if its a 0 or a 1, the whole files are just stored in the program and the program checks for a 1 or a 0.
I think the next big breakthrough will be a compressor that can take a file with not much repetition of data (therefore hard to compress using current algorithms) and create a file with much more repetition in it (and perhaps larger) and then compress that down.
Esperandi
This is actually a serious question! (Score:1)
MandrakeSoft and the Nasdaq (Score:1)
Corel vs Mandrake (Score:1)
Corel is great, Corel is not only a hype, Corel is easy to install, Corel install has a great partitioning tool, the download version of Corel is really a full package, after Corel had finished I wondered: Okay and when are we going to start installing hardware? Turned out it had already done that part...
rm s/corel/mandrake
p.s.
While you prenounce Mandrake as "Correll" you write Mandrake7, cappice? And all the media-hype
about "Correll" is true, besides the tiny mistake,
that they keep saying it's a Canadian compagny, while it is French ofcourse..
Re:Why do you force the use of TAR? (Score:1)
I use SunOS 5.6 for coursework at school, and one thing that really annoys me is that the version of tar on the school's system doesn't have the z switch, which forces me to use two commands when only one should be necessary. In cases like that, it would be better to have a program that does archiving and compression in a single step.
Re:This is actually a serious question! (Score:1)
Loup doesn't exist as a first name by itself, only with the "Jean". And yes, loup means wolf in French.
Stéphane
Re:This is actually a serious question! (Score:1)
I was expecting this question
Actually I don't live in France, but in Belgium. That's probably what saved me ! Jean-Stéphane... It sounds horrible !
Uh-Uh... Yeah ! (Puff Daddy ruining Led Zeppelin, Godzilla Soundtrack.
Stéphane
arithmetic coding (Score:1)
Re:Mandrake for i386? [off-topic] (Score:1)
Does anyone know why Linux desktop systems is astoundingly slow on a 486, but it runs faster than Windows on modern machines?
Re:What about wavelets? (Score:1)
Re:What About Ada? (Score:1)
One particular area where I think Ada 95 really beats any other OO language I've seen is the way that separate constructs are used for encapsulation and objects. This means that objects appear as normal parameters to their despatching operations - none of this object.function(params) of other languages.
Linux for Everyman (Score:1)
AMD K6 and AMD K6-2 Memory B Stepping issue (Score:1)
Re:AMD K6 and AMD K6-2 Memory B Stepping issue (Score:1)
LZW/".Z" decompression not covered by patent? (Score:2)
Re:Bzip2 stability (Score:2)
I would say the warnings, these days, are probably more in the line of disclaimers than real warnings. (ie: Don't blame me if your machine turns into a mushroom.)
Bzip2 stability (Score:2)
Does anybody know how long bzip2 has been out? (Honestly curious) It's a damn good compression algorithm, even if it is slow. I've never had a single problem with data corruption or loss (with the one exception of a file that got trashed because the disk was bad, not the compression algorithm). I would think that if the bzip2 format is good enough to use to distribute the linux kernel to the world, it's probably good enough for every day files.
Any other people with comments/thoughts on the stability of bzip2? Why all those warning messages - just because it isn't v1.0 yet?
LinuxOne/Mandrake (Score:2)
He told me that LinuxOne had done a lot of things such as printed up stickers and promotional items with the Mandrake character on them, and also other stuff with the tophat and magic wand on it to promote LinuxOne type things. For those of you who were at the Expo, you'd probably think it was pretty brazen since LinuxOne's booth was right behind Mandrakes! It was, and it was true, they were handing some of that stuff out on Friday (the last day of the show).
He said, (as I remember) "We would have given them permission to use them if they had just asked, but they never did". What he didn't say, (but that I got out of the conversation) was "Those guys have got a lotta fuckin' nerve to be doing that."
Big Money, Big Ideas. (Score:2)
And i know, one per post, but I've always wondered. Should big money support developers directly by hiring them, or by donating to development groups? I sure wish a big company would pay me to work on opensource projects from home, I could do a lot more work if i didn't have to go to my job every day
Re:arithmetic coding (Score:2)
That's easy: there's a fairly exhaustive patent space of arithmetic encoders. A quick search of www.patents.ibm.com [ibm.com] on "arithmetic encoder" listed seven patents, including 4488143, which featured a convenient link to how to license the patent from IBM. The other 6 were owned by Canon (2), Mitsubishi (2), Lucent, and Sharp. IIRC, there were a couple others (I used to work in compression) owned by Sony, etc., that didn't turn up in this rudimentary (30-second) snapshot.
This is my opinion and my opinion only. Incidentally, IANAL.
compression algorithms (Score:2)
what do you see it's future as being, in relation to gzip ?
do you see any wide-spread use of large-model compression schemes (for example PPMD) ?
Patent issue (Score:2)
Arithmetic encoding is no big deal, however. The author has stated (I forgot where) that it only gives you a 1-2% improvement in compression ratio. I think it also runs faster, but only by a similarly small margin.
Mandrake for i386? (Score:2)
be released?
Our user group has a number of 486 machines
we'd like to put Mandrake on. Mandrake has the
best 'out of box' desktop right now.
Re:Compression software (Score:2)
Algorithms differ in what they offer at what price:
Gzip is a good general-purpose compressor with the additional quality of being written in C and availalbe under the GPL.
So you have to choose according to your needs.
Re:Improved compression? (Score:2)
> can take a file with not much repetition of data (therefore hard
> to compress using current algorithms) and create a file with
> much more repetition in it (and perhaps larger) and then compress
> that down.
That's precisely what the filters in the PNG graphics format do. By calculating differences between adjacent pixels, a new datastream is created that has more repetition in it, and that new datastream is handed to Jean-loup's DEFLATE engine. Because the PNG filters are reversible, the original datastream can be recovered after decompression.
Re:Improved compression? (Score:2)
(9)(B)(1)(Y)(9)(B) you'd have (19)(B)(10)Y, there, compression. 6 bytes down to 4. That's just a general idea and not the idea I had at all (mine involved bitwise operations on a couple of bytes, producing 3 bytes from which the original 2 could be reconstructed, and I think the 3 produced would have more repetition in a large random set than the 2 originals alone, I might be completely wrong about that though).
Esperandi
Re:Improved compression? (Score:2)
I'm going to have to spell this out very carefully being as its the third time I have said it and no one seems to comprehend it. NO DUH! I have never claimed that any technique I may come up with or anyone else might come up with would be such a universal compressor. I didn't even say in my original response that the jumps in compression would be in universal compressors. You are right that no compression algorithm can be guaranteed to, say, reduce a file to exactly half of its original size. I've never thought, stated, or insinuated anything even remotely along those lines.
I'm surprised you claim that its mathematically impossible to create artificial repetition in order to improve compression considring several things. Number one, I gave an example that proves that it is possible, at least in a simple way. Number two, as per a couple replies to an earlier post of mine in this thread (whihc you must not have read), PNG and bzip2 do exactly this. So apparently they're using a different set of mathematics than you.
While it may be true that the average person sees that text files get real small with zipping and the average person takes about an hour to convince them that they can't recursively zip a file into 1 byte, I would think from my previous discussions mentioning various algorithms that it would become obvious that I'm quite well-versed in data compression (not in theoretical entropy sustaining junk, in real taking X number of bits and figuring a way to reconstruct them from a smaller set of bits).
The advances in data compression will come like MP3 did (already a disproof that there will be no great advances, audio compression before MP3 was either inefficient or extremely lossy), taking advantage of the peculiarities of the data being compressed. Video compression still has a lot of room for improvement, and that improvement will be developed by streaming media companies. 3D point data, texturing files, text, etc... there are thousands of kinds of files. In the future there'll be one compression program that uses hundreds or thousands of algorithms to deal with them. it might even use the same extension for its files to further confuse people...
Esperandi
The future is going to be fun. Everything is always getting better. Anyone who tells you the past was better is wrong.
Linux Start-Ups in France (Score:2)
Everyone knows MandrakeSoft was started as a French company, but I was wondering what kind of problems you encountered while trying to start a Linux based company in France.
Re:Improved compression? (Score:2)
A lossless reversible compression algorithm that signficantly (indeed at all) reduces in length more than a minute proportion of all possible inputs (please contact me if you want a clarification of this statement; I'm handwaving the precise definition of the phrase minute proportion) is impossible. It will simply not happen. The pigeonhole principle is the most basic argument against universal compressors; a given string of bits cannot represent more *distinct* other string of bits than the total number of *distinct* states of that string of bits.
Your "big breakthrough" is a standard claim made by a) well-meaning people who think they have a great new algorithm that revolutionizes data compression or b) con-artists who want to persuade people to part with their money for the next best thing. The key difference is that the first category is usually willing to learn.
For more information, please read the comp.compression FAQ.
Re:Improved compression? (Score:2)
This notwithstanding, the simple fact remains that no algorithm can compress all possible inputs, regardless of what transformations are performed on it, out of simple uniqueness and distinguishibility considerations.
Cross platform Mandrake (Score:2)
Keep up the good work.
Mandrake and Netware (Score:2)
Is Mandrake planning on including such apps the way Debian is? This would be another big plus for Mandrake, IMHO.
Keep Mandrake coming!
mad-cat
Improved compression? (Score:2)
gzip in a resource-rich environment (Score:2)
Why do you force the use of TAR? (Score:3)
Using tar makes things unnecessarily complicated. There is less support for tar around on non-UNIX platforms, and 'embedded compression/archiving' seems to cause great trouble for newbies who can just about handle WinZip and nothing more.
If gzip is to become a truly viable alternative to patented zip, I think the .tar.gz should become a thing of the past.
Remove the old legacy tape archive!
Alternate Algorithms within GZIP (Score:3)
Would you think it wise to roll alternatives to the Lempel-Ziv algorithms into gzip to make other compression utilities less attractive?
It seems that this approach is adopted by other applications (ssh uses multiple encryption engines, and TIFF has allowed several compression techniques for quite a long time).
Would you support an effort to implement bzip2 within gzip? Do you think such a thing could be done while maintaining gzip's stability?
Go! (Score:3)
I notice you are a keen Go player... the GNOME version of Go (Iagno) seems much more attractive to me than the KDE version (kgo). I was wondering what software you use to play games, or are you not really interested in the interface at your level of play?
Regards,
Denny
# Using Linux in the UK? Check out Linux UK [linuxuk.co.uk]
Astronomical! :) (Score:3)
On your website, in the history section, you have a link to some information about pulsars...
Were you an astronomy student, and if so how did you go from studying pulsars to CTO of a major Linux distributor?!?
Regards,
Denny
# Using Linux in the UK? Check out Linux UK [linuxuk.co.uk]
LinuxOne (Score:3)
Nasty Code (Score:3)
z = (z = g - w) > (unsigned)l ? l : z;
It makes your code almost impossible to read. Do you even know what this line does anymore?
Re:Nasty Code (Score:3)
I hate to sound like I'm flaming you, but this is the standard idiom in C for addition with saturation. When (g-w) is larger than a certain constant l, z is assigned to that constant l, otherwise, z will retain its value.
This code can also be written less efficiently (well, at least if your compiler doesn't have common sub-expression elimination) as:
if((g-w) > (unsigned) l){
z=l;
} else {
z=g-w;
}
Re:Improved compression? (Score:3)
A much more intuitive argument is the "pigeonhole principle." Let's assume that there are 16 holes in a wall, to which each is associated with a message. It is impossible for 17 messages to each be uniquely associated with a hole because there are not enough holes avalible. A 4-bit file can only represent 16 different messages, regardless of what algorithm is used to compress the message...unless, that is, you don't care about the compression being reversible!
a new question - the guy above asked mine :-( (Score:3)
patents in software design (Score:3)
Why is Mandrake better than Redhat? (Score:4)
I guess that you have at least a little something to say about this.
Is the 586 optimization enough to justify Mandrake's position? Are you especially proud of any of the architectural differences between the distributions (from what I have been told, the Apache-PHP layout is quite a bit different).
How do feel about the steps that Red Hat has taken to change their distribution in reaction to yours?
What about wavelets? (Score:4)
The Data Compression Book was an excellent reference when it came out, but there are some hot topics in compression that it doesn't cover - frequency-domain lossy audio techniques (MP3), video techniques (MPEG2 and especially MPEG4), wavelets (Sorenson video uses these, I believe, and JPEG2000 will), and the Burrows-Wheeler transform from bzip.
Do you have any plans for a new edition of the book, or good Web references for these techniques? BZip is covered well by a Digital research note, but documentation for MPEG2 seems only to exist as source code and I can't find anything concrete about using wavelets for compression. The data is all there on the comp.compression FAQ, but the excellent exposition of the book is sorely lacking.
Go and Compression (Score:4)
Inquiring minds want to know.
bzip2 Support (Score:5)
Will BW be an algorithm option within the gzip file format itself ever?
Compression software (Score:5)
However, much of the software you've written has started gathering a few grey hairs. Gzip, for example, has been at 1.2.4 for many, many moons, and looks about ready to collect it's gold watch.
Is compression software in a category that inherently plateus quickly, so that significant further work simply isn't possible? Or is there some other reason, such as Real Life(tm) intruding and preventing any substantial development?
(I noticed, for example, a patch for >4Gb files for gzip, which could have been rolled into the master sources to make a 1.2.5. This hasn't happened.)
Winzip (Score:5)
Just out of curiosity, (tell me it's none of my business if you want to and I'll be OK with that) did you receive a licensure fee from the company that makes Winzip for the code?
Proprietary algorithms (Score:5)
What do you think of the expansion of trade-secret algorithms (MP3 quantisation tables, Sorensen, RealAudio and RealVideo, Microsoft Streaming Media) where the format of the data stream is not documented anywhere?
Tom
Compression patents (Score:5)
The Data Compression Book (Score:5)
Should software authors continue to write their own compression routines, or simply trust the versions available to them in library form?
I can see some definite advantages to library code, i.e. the ability to upgrade routines, and having standardized algorithms which can be read by any program which utilizes the library.
Doug
A question about Mandrake... (Score:5)
Brad Johnson
--We are the Music Makers, and we
are the Dreamers of Dreams