Follow Slashdot blog updates by subscribing to our blog RSS feed

Linguistics Meets Linux: A Review of Morphix-NLP 186

Posted by CowboyNeal on Thursday December 11, 2003 @10:24PM from the natural-linux-processing dept.

Emre Sevinc writes "Zhang Le, a Chinese scientist working on Natural Language Processing has decided to pack the most important language analysis and processing applications into a single bootable CD: Morphix-NLP. More than 640 MB of NLP specific software is included and there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."

This discussion has been archived. No new comments can be posted.

Linguistics Meets Linux: A Review of Morphix-NLP

Load All Comments

Search 186 Comments Log In/Create an Account

Comments Filter:

Ironic.. (Score:5, Funny)

by grub ( 11606 ) writes: on Thursday December 11, 2003 @10:26PM (#7697017) Journal

All this language processing packed onto a single CD yet /. can't run a spellchecker... :)

Share
twitter facebook
- Re:Ironic.. (Score:1, Funny)
  
  by 0x12d3 ( 623370 ) writes:
  
  And there's still a lot of place on the CD which uses a compressed filesystem for bringing us the best of both worlds."
  
  Maybe Slashdot is out of place on their servers.
- - Re:My wish (Score:2)
    
    by revividus ( 643168 ) writes:
    
    ???
    You do know this is /., right?
Noooo (Score:5, Funny)

by lakeland ( 218447 ) writes: <lakeland@acm.org> on Thursday December 11, 2003 @10:27PM (#7697031) Homepage

I was in the process of downloading this already. Damn you slashdot!

Share
twitter facebook
- Re:Noooo (Score:2, Informative)
  
  by FooAtWFU ( 699187 ) writes:
  
  Should have used BitTorrent. Then it'd be "I was in the process of downloading this already. Yay for Slashdot!!!"
  - Re:Noooo (Score:2)
    
    by lakeland ( 218447 ) writes:
    
    Indeed. When I get it down I promise to put up a .torrent. Unfortunately I'm only getting 15Kb/s currently (9 hours remaining)
that's pretty cool (Score:3, Insightful)

by homerjs42 ( 568844 ) * writes: on Thursday December 11, 2003 @10:28PM (#7697033)

This is a pretty cool thing. It seems like the kind of thing that would be of great use to anthropologists or others translating from a language that is more or less unknown. By unknown, I mean not used commonly outside of its people group, and probably unwritten.
Neat.
--dw

Share
twitter facebook
- It's actually useless for that (Score:5, Interesting)
  
  by scheme ( 19778 ) writes: on Thursday December 11, 2003 @10:55PM (#7697224)
  
  This is a pretty cool thing. It seems like the kind of thing that would be of great use to anthropologists or others translating from a language that is more or less unknown. By unknown, I mean not used commonly outside of its people group, and probably unwritten. Neat.
  
  Actually, this software seems like it would totally useless for that purpose. The software was developed and has a bunch of heuristics and domain knowledge put in by experts in english or the relevant language. Without similar expertise, the software can't be adapted to a new language. The software isn't a universal translator.
  
  So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
  
  Parent Share
  twitter facebook
  - Re:It's actually useless for that (Score:1, Informative)
    
    by homerjs42 ( 568844 ) * writes:
    
    Actually, this software seems like it would totally useless for that purpose. The software was developed and has a bunch of heuristics and domain knowledge put in by experts in english or the relevant language. Without similar expertise, the software can't be adapted to a new language. The software isn't a universal translator.
    So your hypothetical anthropologists or translators would still need to spend time and learn the language in question.
    Well, yeah. I _know_ that. I was just speculating that suc
  - Re:It's actually useless for that (Score:4, Informative)
    
    by Anonymous Coward writes: on Friday December 12, 2003 @12:33AM (#7697836)
    
    While right on this probably not being of much help to the typical anthropologist, it's not at all true that most of the software has lots of built in domain knowledge.
    
    At least half the tools are general purpose applications for constructing various kinds of models, whether they be trees or HMMs or n-gram models or entropy models.
    
    Believe it or not a lot of NLP work gets done on understanding algorithms that apply broadly across languages.
    
    There is some English specific stuff on the CD, but most of it isn't.
    
    The only software
    
    Parent Share
    twitter facebook
  - Not really. there's already UG (Score:1)
    
    by geekpuppySEA ( 724733 ) writes:
    
    Eh, there is a universal translator: the idea that all ideas are experienced in the same way by humans. Ergo Universal Grammar.
    So what are you waiting for... linguists are waiting for the geeks to make data gathering easier, to give us more grist for the microscope.
    And besides - the more language data we get, the more complex mindlike matter we can incorporate into games and sims... so hop to it, people. You've got that girlfriend or nemesis to animate, har.
- Re:that's pretty cool (Score:5, Interesting)
  
  by belmolis ( 702863 ) writes: on Friday December 12, 2003 @04:42AM (#7698832) Homepage
  
  Actually, not very many anthropologists these days do much linguistic work. That's partly because linguistics has developed as a separate field and partly because cultural anthropology was largely taken over by Postmodernists, as a result of which it has nearly died. Most research on "exotic" languages these days is done either by linguists or by missionaries (who want to translate the New Testament).
  
  I am a linguist and have done extensive fieldwork, mostly on Carrier [ydli.org], the native language of a large region of northern British Columbia. (I also hack a little. Once upon a time I wrote the head-final shell mentioned in Charles Dodgson's comment [slashdot.org].) Software is increasingly used for this kind of work, but for the most part it is not the sort of NLP software provided on the Morphix-NLP CD. A lot of that software is useful primarily if you've got a large corpus to work with, and it often presupposes that some basic resources exist, such as a lexicon, or at least a wordlist with part of speech information. For many languages even basic resources such as a lexicon don't exist or aren't available in electronic form, and when you're dealing with really small languages, there aren't any ready-made corpora, such as news text. If you want a text corpus, you've got to make it yourself, usually by recording people telling stories or whatever, and transcribing it. This is an important part of fieldwork, but its incredibly slow and tedious.
  
  There are some tools designed specifically for this kind of linguistic research. One is Transcriber [upenn.edu], a tool that assists a human being in transcribing audio recordings. One of the older tools is Shoebox [sil.org] a dictionary database program for field linguists, originally written to run under DOS.
  
  Some of us have used Unix tools to extract and process information, e.g. grep to do regular expression searches. Ken Church at Bell Labs used to give a tutorial "Unix for Poets" on how to use Unix tools for linguistics. Here is his handout [att.com]. For example, I've produced dictionaries of several dialects of Carrier using scripts written mostly in AWK plus the usual Unix tools, controlled by elaborate Makefiles. Some of us also use emacs a lot, not only as an editor but for doing searches. If you're interested in what kinds of software are of interest to linguists, you might check out the Computational Resources for Linguistic Research [upenn.edu] page.
  
  It is worth mentioning that spread of the internet has made available a lot of useful material for linguistic research. There are now quite a few languages for which you can obtain a good chunk of text (say at least 100K words), and often you can find parallel text (that is, the language you're interested in plus a translation into English or another language that is useful to you). But this works mostly for relatively big languages, that is, say, languages with a million or more speakers. There are around 340 such languages, depending on how you count, about 2% of the world's oral languages.
  
  One topic that concerns some of us is how software and other technology can speed up the process of documenting dying languages. Languages are rapidly become extinct - some experts estimate that as many as 90% of the languages currently spoken will be extinct in 100 years. [Computer languages may be proliferating at the same rate.:)] The late Ken Hale [anu.edu.au] had seven languages die on him. If we don't find a way to speed up the documentation, or slow down the rate of extinction, most of those languages are going to die without very much being known about them.
  Read the rest of this comment...
  
  Parent Share
  twitter facebook
  - Linguistics and Anthropology (Score:2, Informative)
    
    by Enkerli ( 554033 ) writes:
    
    As both a partly self-labeled linguistic anthropologist and a cultural anthropologist, I would like to respectfully qualify the parent's statements on the state of the field. This really isn't meant as a flame but I do enjoy discussions on the difficult relationship between linguistics and anthropology.
    First, while anthropology seems to emphasize linguistics to a much lesser degree than in Boas' era, a large number of anthropologists do work on language, in one way or another. Granted, the groundwork of dec
Great... (Score:2, Funny)

by Anonymous Coward writes:

This means that GCC will have to be expanded to be expanded to support all human languages as well as programming languages...
- Re:Great... (Score:5, Funny)
  
  by lakeland ( 218447 ) writes: <lakeland@acm.org> on Thursday December 11, 2003 @10:31PM (#7697058) Homepage
  
  Actually, I saw someone working on something like parsing english as a programming language, try a Google for 'controlled english' sometime. The general idea is that management may not be able to write the specifications, but they can read them and tell you it isn't what they're really after _before_ you code the thing.
  
  Parent Share
  twitter facebook
  - Re:Great... (Score:1)
    
    by adrianbaugh ( 696007 ) writes:
    
    I wondered (for about 5 seconds, once) about writing a doctype for english, similar to those for HTML.
    - Re:Great... (Score:2)
      
      by Hanji ( 626246 ) writes:
      
      I once considered trying to write out a rough BNF definition of English ... I gave up when I realized it'd be largely useless without a way to differentiate between different parts of speech, which I was too lazy to try to figure out how to do better than just a massive hand-entered database :-D
      - Re:Great... (Score:3, Interesting)
        
        by lakeland ( 218447 ) writes:
        
        You can get such lists pretty easily without having to type them in. Just looking up the most frequently used POS for that word gives almost 90% accuracy. Alternatively I wrote a program that automatically predicts the POS for new words.
        
        However, your BNF grammer is likely to come unstuck as soon as you try to parse either casual english or moderately complex english. Either one very quickly leads to adding lots of infrequently used grammar rules, and hence lots of ambiguity in even simple sentences.
        
        The
        
        Re:Great... (Score:2)
        
        by Doomdark ( 136619 ) writes:
        
        Given a few years, I wouldn't be surprised to see a program like that be the basis of the next big thing in programming languages.
        Why is that? I generally believe in using right tool for the job... and controlled or non-controlled, human languages that I'm familiar with do not seem to have much benefits over existing programming languages?
  - Aren't patents written in that? (Score:1)
    
    by tepples ( 727027 ) writes:
    
    I saw someone working on something like parsing english as a programming language
    
    I thought English was already a programming language [catb.org], designed for querying PICK databases.
    
    But seriously, don't patents try to describe a process in a limited subset of the English language?
    - Re:Aren't patents written in that? (No) (Score:2)
      
      by waterbear ( 190559 ) writes:
      
      I saw someone working on something like parsing english as a programming language
      
      I thought English was already a programming language, designed for querying PICK databases.
      
      But seriously, don't patents try to describe a process in a limited subset of the English language?
      
      Seriously, no, patents don't have any linguistic axe to grind. The function of a patent specification is to tell the world, in language that the ordinary specialist in the field will be able to understand, that here is a new and useful
  - Re:Great... (Score:4, Interesting)
    
    by millette ( 56354 ) writes: <robin&millette,info> on Friday December 12, 2003 @12:48AM (#7697918) Homepage Journal
    I guess this would interest you too. BTW, have you read "Le Ton Beau de Marot" by Hofstadter?
    In 1977, Xerox adopted Systran for internal translations by creating a Multinational Customized English that's easier to translate. [1]
    In 1930, C.K. Ogden proposed a tiny version of English: just 850 words that could be learned in a few months and used to say anything. He called it Basic English (BE). [2] [3]
    
    basic english [diac.com]
    
    machine translation [nbrigham.org]
    
    xerox systran [compuserve.com]
    Parent Share
    twitter facebook
  - Re:Great... (Score:1)
    
    by bstone7 ( 686090 ) writes:
    
    Didn't they call that COBOL?
  - Clarification: Controlled Language [Re:Great...] (Score:2, Informative)
    
    by j.leidner ( 642936 ) writes:
    
    Controlled language is the conscious decision of an organisation to use only a subset of what a natural language like English offers in technical documentation (medical leaflets, submarine documentation, maintenance manuals, software documentation) in order to avoid confusion.
    (1) Insert the knob behind the lever.
    In (1) you could perhaps use a handfull of terms instead of "knob" -- controlled language enforces only certain licensed terms, this increasing overall consistency (same terms for same thing).
    - Re:Clarification: Controlled Language [Re:Great... (Score:2)
      
      by dvdeug ( 5033 ) writes:
      
      (2) He saw the girl on the hill with the telescope.
      
      where "saw" could be past of "to see" or have another (more morbid) interpretation.
      
      No, it couldn't. I saw, you saw, but he saws. Proper verb conjugation won't allow your alternate interpretation.
- Re:Great... (Score:2)
  
  by Lussarn ( 105276 ) writes:
  
  Yepp, like no doctype HTML... Noone knows if it's suppose to be human or machine readable.
- First application: (Score:2)
  
  by DrCode ( 95839 ) writes:
  
  spreadsheet.eng:
  ---
  Write a spreadsheet that's Excel-compatible.
  ---
  
  gcc -o spreadsheet spreadsheet.eng
So this means (Score:3, Funny)

by YoungBonzi ( 692874 ) writes: on Thursday December 11, 2003 @10:32PM (#7697066) Journal

Maxis will have The Sims actually talking, instead of looking "special".

Share
twitter facebook
Anyone remember Forum 2000? (Score:3, Interesting)

by Stile 65 ( 722451 ) writes: on Thursday December 11, 2003 @10:36PM (#7697098) Homepage Journal

Does anyone remember Forum 2000 [forum2000.org] (link does not actually work)? It's got some neat technology [andrej.com] behind it. And the conversations between surfers and the SOMADs was hilarious. When I first saw the site, I thought it was actual people imitating the different characters. Does anyone know what happened to the site and why it no longer functions? I miss it.

Share
twitter facebook
- Re:Anyone remember Forum 2000? (Score:4, Informative)
  
  by Anonymous Coward writes: on Thursday December 11, 2003 @10:50PM (#7697193)
  
  New version? Got this after some googling
  http://www.forum2010.org/
  
  Parent Share
  twitter facebook
  - Re:Anyone remember Forum 2000? (Score:2)
    
    by Stile 65 ( 722451 ) writes:
    
    I love you. Thanks!
    
    For those who aren't surfing at 0 or -1, someone graciously provided this link [forum2010.org].
    
    Now I'm going to surf and hope I find the wisdom of Ayn Rand on the new site as well. *cackles*
- Re:Anyone remember Forum 2000? (Score:4, Informative)
  
  by generic-man ( 33649 ) writes: on Thursday December 11, 2003 @11:07PM (#7697284) Homepage Journal
  
  There was a brief time when they were Forum 3000, but the domain [forum3000.org] has fallen into the hands of domain squatters.
  
  Forum 2000 and 3000 died mainly because the people who ran them got bored and/or wanted to work on their graduate theses. It sure was fun to play with the Zephyr interface while it lasted, though. :)
  
  I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
  
  Parent Share
  twitter facebook
  - Re:Anyone remember Forum 2000? (Score:2)
    
    by Stile 65 ( 722451 ) writes:
    
    No, it appears to be run by another person. And it's missing Ayn Rand. Still, it's quite amusing. :)
  - Forum2000 is dead. Long live Forum 2010! (Score:4, Informative)
    
    by Neuracnu Coyote ( 11764 ) writes: on Thursday December 11, 2003 @11:39PM (#7697480) Homepage Journal
    
    I wonder whether Forum 2010 is run by the same folks. I doubt it since Forum 2000 and 3000 were both Carnegie Mellon projects, and forum2010.org is registered to someone in St. Louis.
    
    That's me, actually. You can't expect hundreds slashdot geeks suddenly slamming my site and having me not notice. ];-)
    
    Forum 2010 had, in fact, nothing to do with the great fellows at Forum2k/3k aside from inspiration. And, just to end the rumors, I built the F2.01k matrix and all my own SOMADs as a senior project for my Comp Sci degree at Fontbonne University [fontbonne.edu].
    
    Now, I'm late for a date! Please don't destroy the matrix while I'm gone!
    
    Parent Share
    twitter facebook
    - Re:Forum2000 is dead. Long live Forum 2010! (Score:2)
      
      by Stile 65 ( 722451 ) writes:
      
      Neat! Did you write the QSA code yourself or adapt the code written by the original CMU researchers?
Why Linux is great for doing applied linguistics? (Score:5, Informative)

by dark-br ( 473115 ) writes: on Thursday December 11, 2003 @10:43PM (#7697145) Homepage

This page [bigpond.com] has some reasons.

Share
twitter facebook
- Re:Why Linux is great for doing applied linguistic (Score:1)
  
  by d3faultus3r ( 668799 ) writes:
  
  I always thought it was because of so much perl code being obfuscated purposefully. After all, if you can figure out what some of that does without frying your brain from confusion, translating mandarin chinese is no big deal.
eckcha isa outa (Score:1, Funny)

by CPUgrind ( 630274 ) writes:

Ia oundfa aa anguagela ita antca igurefa outa!
Alright! (Score:1)

by Robotbeat ( 461248 ) writes:

I was JUST googling for stuff about grammar and sentence diagramming on computers when I saw this story! Anyways, hopefully this will encourage people trying to make AI (AI capable of passing the Turing test) to use true grammatical parsing/analyzing (a non-open-source unsuccessful attempt is http://www.brainhat.com/). Also, perhaps this will encourage the development of an open-source grammar checker for OpenOffice.org or KOffice.
- Re:Alright! (Score:1)
  
  by anonomouse ( 695568 ) writes:
  
  My Turing test questions: 1. Describe an orgasm. 2. What does very cold ice cream taste like? 3. Describe you worst experience with anxiety. My full Turing test question: 1. Would you like to go for a swim?
Download Link (Score:4, Informative)

by Hal The Computer ( 674045 ) writes: on Thursday December 11, 2003 @11:16PM (#7697345)

Here is where you can go to download the .iso image [nlplab.cn] .
Try not to kill their site. If someone has downloaded it, it would be nice of them to post a .torrent on Slashdot.

Share
twitter facebook
Chomsky and stuff (Score:2, Interesting)

by Saint Stephen ( 19450 ) writes:

This article is about linguistics, and he said "go read Chomsky", so I went and read Chomsky's bibliography. What I'm about to say applies to all modern philosophers and mathematicians:

God damn, them are some fancy-schmancy sounding titles! Does anybody ever get the feeling sometimes that maybe things are simpler than our smartest people currently make them out to be? If you can't talk as simple as I'm talking now, you ain't really "nailed it."

The reason I think this is true: back when all mathematicia
- Re:Chomsky and stuff (Score:2, Insightful)
  
  by idlemachine ( 732136 ) writes:
  
  I both agree and disagree: life *is* that complicated, we just haven't yet come up with workable abstractions for a lot of things that allow us to handle them in the simplified manner you're asking for.
  
  What you're seeing here is the process by which that happens. Chomsky especially is someone whom I don't consider to want to "make [things] out" to be more complicated than they are; on the contrary, he seems to be more about wanting to understand the *true* process that is at work, not the pre-accepted soci
- Re:Chomsky and stuff (Score:1)
  
  by tepples ( 727027 ) writes:
  
  the math they never teach anymore (compound arithmatic, like pounds shillings pence comes to mind)
  
  Like days, hours, minutes, seconds? There still exist measurements that haven't been decimalised.
  - Re:Chomsky and stuff (Score:2)
    
    by Saint Stephen ( 19450 ) writes:
    
    Quick: in your head: how much is 6 dozen and 3 times 7 and 1/2 score? This is the kind of math they used to teach in elementary school in the 1800s.
    
    They don't anymore.
    - Re:Chomsky and stuff (Score:2)
      
      by dylan_- ( 1661 ) writes:
      
      Quick: in your head: how much is 6 dozen and 3 times 7 and 1/2 score? This is the kind of math they used to teach in elementary school in the 1800s.
      
      Well, I'm not that old, but at primary school I certainly learned what a dozen was and my six times table, which covers the first. 3 times 7 is supposed to be difficult? I also learned at primary school what a score was, so I don't think I'm going to have any difficulty halving it.
      
      I don't understand the point of your example. I suspect any 10 year old chil
    - - Re:Chomsky and stuff (Score:2)
        
        by Saint Stephen ( 19450 ) writes:
        
        How many dozens is that?
        
        They used to teach specialized techniques in the 1800s for "compound arithmetic", so you don't have to "blow up" the compounds into their constituent parts. You just gave the answer in pennies or ounces; they needed to know how many schillings or stone + pence or drams. :-)
- Re:Chomsky and stuff (Score:5, Interesting)
  
  by monecky ( 32097 ) writes: on Friday December 12, 2003 @12:14AM (#7697733) Homepage
  
  I'm a programmer getting my masters in linguistics. Computer Science undergrad. Trust me. This is some tough stuff... until you learn the basics. Then everything starts making sense. There is a huge hurdle getting into any field... and it is usually because of the terminology. Every field has it's own terminology because every field needs to be extremely precise in their explanations.
  
  Linguists don't think Knuth is very lucid.
  
  Linguistics is neat. Syntax (the study of the structure of language), Phonology (the study of the interactions of sounds and what a child has to actually 'learn'), Phonetics (the study of the human language system and the sounds that it can produce/hear), and Morphology (the study of the smallest possible unit that holds 'meaning') all work together to form an idea of what goes on in the human mind.
  
  Parent Share
  twitter facebook
  - Re:Chomsky and stuff (Score:2)
    
    by revividus ( 643168 ) writes:
    
    I was exposed to some of Chomsky's linguistic work (as opposed to his political writing/interviews) awhile back, and it was indeed neat. I wasn't taking the course myself, and so didn't dig too deep into it, but I was trying to help someone else with their homework, and even the surface bits I comprehended while I was helping were pretty cool. Chomsky is a smart guy.
    I still find it hard to believe the original parent was serious, though... Roman numerals... :)
    - Re:Chomsky and stuff (Score:4, Informative)
      
      by monecky ( 32097 ) writes: on Friday December 12, 2003 @01:05AM (#7697989) Homepage
      
      There is no talk of linguistics complete without mentioning Chomsky's political diatribes. :)
      
      He pretty much defined linguistic theory for the past 40 years. Once he had a voice he turned into somewhat of a political critic. A conspiracy-theorist. I don't see him solving any political problems, and I don't know how well respected he is by those who study such things, but I think he's a loon. (But, oh god, I wish I could study with him. :) )
      
      Chomsky's papers are tough to comprehend for beginners. (Which I am.) Those who are interested in learning Chomskian theory may wish to pick up some Andrew Radford. (he is very understandable, and his book "Transformational Grammar" is aimed at the undergraduate level syntax class. Once you tackle that, you can read Haegemann, "Government and Binding," which seems to be the most used graduate level book... but this one is quite boring.)
      
      In the meantime, a linguistic glossary which may help you get through some of the papers you may find: http://tristram.let.uu.nl/UiL-OTS/Lexicon/
      
      Parent Share
      twitter facebook
      - Re:Chomsky and stuff (Score:2)
        
        by rnd() ( 118781 ) writes:
        
        I'm not sure how much Chomsky you've read, but it sounds like you've read a lot. I think that based on the following facts it is likely that syntax (the field) will have a tough time making unified scientific progress:
        
        1) There are few people who are both trained syntacticians and native speakers of all of the obscure languages needed to provide data to test an aspect of a theory of syntax, and so native speaker judgments are required in order to "prove" a given theoretical contribution... There end up be
  - Re:Chomsky and stuff (Score:1, Insightful)
    
    by WFFS ( 694717 ) writes:
    
    Um, you forgot Semantics (the meaning of language), one of the more currently important topics.
    
    I'm doing my BSc, majored in maths and CS, and currently doing honours in CS. However, my project/thesis is on Language Technology, based squarely around semantics (for verbs to be precise).
    
    Now, my point is basically agreeing with the above poster. I can't really go in depth about my project with the average Joe/Jo, because it is just too complicated. There is too much jargon and linguistic basics that would
- Re:Chomsky and stuff (Score:2)
  
  by revividus ( 643168 ) writes:
  
  back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division.
  Especially considering that 3, 7 and 12 were all 3 digit numbers, whereas 2, 6, and 9 had 2 digits, and 1, 5 and 10 had one; and 8 had four! Holy crap!
  This has to be the funniest troll I've read in ages. My compliments!
- Re:Chomsky and stuff (Score:5, Interesting)
  
  by kramer2718 ( 598033 ) writes: on Friday December 12, 2003 @12:38AM (#7697866) Homepage
  
  Well, I'll answer your questions both in respect to NLP, and also more generally.
  
  First of all, most practical NLP techniques aren't *that* complicated simply because they must be able to be computed quickly. There are quite a few statistical hacks prevalent
  
  Most NLP techniques use probabilistic variants of two models finite automata and pushdown automata (both models are actually pretty simple, but if you don't know what they are, they may sound complicated).
  
  Finite automata consume input and transition to different states (a finite number of them) based on that input. They can also be interpretted as generating output instead of consuming input.
  
  Push down automata are almost the same except that they have a stack that they can push symbols onto. Another name for push down automata are Context Free Grammars.
  
  As I said above, most NLP techniques use probabilistic variants of and small extensions to these two concepts.
  
  The reason that Markov models (probabilistic finite automata) work so well to model speech is because they are flexible, simple, and linear just like speech. The reason that CFGs work so well to model language is that they are flexible, and hierarchical, and so can capture the recursive nature of language (think about "the man who killed the horse who killed the dog who...").
  
  Having said all of that, I don't think that these models capture the way that humans process language/speech. I think that neural networks have the potential to capture this better. They just aren't mature enough. We also don't really have a good architecture to run neural networks. A human brain has about 10^14 neurons (within a couple of orders of magnitude) that run in parallel. Try simulating that on todays serial architectures, and you'll run into problems.
  So my hypothesis is that there is probably some inherently simple learning algorithm for neural networks that we just don't know yet that will help solve many different types of problems (there is some biological evidence of there being a single learning algorithm implemented in the brain).
  
  So yes, there is likely a simpler answer, but until we know it, we have to use heuristics and statistical hacks in order to build systems that work.
  
  As to science in general, the reason it all sounds complicated is twofold:
  
  First things interect in a very chaotic way. Even if the interactions are simple, when you compose many very small interactions, you find complex behavior.
  
  Secondly, even if the interactions are actually simple, we humans with our Neutonian intuitions have a hard time understanding non-Neutonian interactions.
  
  Hope that helped.
  
  Parent Share
  twitter facebook
  - Re:Chomsky and stuff (Score:3, Informative)
    
    by dido ( 9125 ) writes:
    
    Actually, Chomsky (or one of his contemporaries anyhow) discovered early on that almost no natural language can be represented solely by regular languages, or even context-free languages. Chomsky initially even tried to use unrestricted/semi-Thue grammars to represent natural languages, but realized just as quickly that this HUGE class of languages is much, much too big (in fact, it's actually Turing complete, and only useful to those doing research in the theory of computation, not the theory behind human
- Do you honestly believe that? (Score:3, Interesting)
  
  by Kjella ( 173770 ) writes:
  
  The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
  
  That's also why none of the good stuff was made by the Romans - it was the Greeks, then the Arabs that had good numerals, made the discoveries, before the knowledge of a proper number system finally retu
  - Re:Do you honestly believe that? (Score:2)
    
    by Saint Stephen ( 19450 ) writes:
    
    If you think that's ever going to be something you can slap up on the blackboard in an hour, you're wrong.
    
    All I'm saying is that 2000 years ago it took a 60 year old man hundreds of pages to describe techniques for long division, and they had LONG, LONG discussions about how stuff was made of earth, wind, and fire . You *seriously* believe similar advancements won't be made 1000 years from now that put our science in a similar light?
    I'm not saying the techniques 2000 years ago weren't valid, or the
  - Re:Do you honestly believe that? (Score:1)
    
    by CableModemSniper ( 556285 ) writes:
    
    But in reality it's simply that for every finite number there is a conventional, finite proof. Let's say I want to prove it for f(325266235235352): f(1) is true. Since f(1) is true, f(2) must be true. Since f(2) is true, f(3) must be true. .... Since f(325266235235352 - 1) is true, f(325266235235352) is true.
    Thats not really explaining how it works. Its not inifinty magic yes, but demonstrating it with an arbitrary large number doesn't explain it anymore then saying it is infinity magic. Which makes me wo
  - Re:Do you honestly believe that? (Score:2)
    
    by bluGill ( 862 ) writes:
    
    Would you go back and explain that to the math professor who game me a zero on one problem I proved by induction? There was no mistake in my proff, the initial condition was right, as was the induction step, and about half the class got exactly the same answer (ignoring a few trivial mistakes).
    Seems that induction, for all its power isn't perfect, and it took less than 1 minute to demonstrat a contradiction (which was obvious to the other half of the class).
    As we were then reminding, induction is not a m
    - - Re:Do you honestly believe that? (Score:2)
        
        by bluGill ( 862 ) writes:
        
        I'm 5 years out of that class, and only remember the result, not the exact problem. The homework and tests I did save were destroyed in the flood last summer. It was in logic, I remember that much, though I'm sure most math studies can bring up a similear problem..
- Common in mathematics (Score:2)
  
  by DrCode ( 95839 ) writes:
  
  You're likely correct. I've heard that often, the first person to prove a theory in mathematics does it in a very complex way. Later, other mathematicians figure out how to simplify it. It's a little like cleaning up someone else's code.
- Re:Chomsky and stuff (Score:2)
  
  by Dan D. ( 10998 ) writes:
  
  The reason I think this is true: back when all mathematicians only had Roman Numerals, the process for explaining how to multiple 3-digit numbers was extremely opaque, and it was nearly impossible to describe how to do long division. Now we can teach 3rd/4th graders how to do it before they watch "Barney".
  As with everything pretty much. You have to understand it before you can express it simply. Just because the "smart guys" don't express it simply yet, doesn't mean they should just give it up. They s
Omission of Gate (Score:3, Informative)

by use_compress ( 627082 ) writes: on Thursday December 11, 2003 @11:33PM (#7697450) Journal

I was surprised to read that GATE [gate.ac.uk] was not listed in the package list [nlplab.cn]. It's the best piece of software to tie together the descrete components that were included. Another complaint is that are a lot of so-so implimentations of very good algorithms. (#define NOT_FLAMEBAIT = 1) I suppose that you have to turn to corporate software to get the really robust implimentations and to free software when you want the cutting edge.

Share
twitter facebook
- Re:Omission of Gate (Score:1)
  
  by iplayfast ( 166447 ) writes:
  
  Gate needs info filled in, in order to download it. So it looks like they didn't want to step on toes.
Should it be patented? (Score:2)

by Progman3K ( 515744 ) writes:

Can the idea of producing a modular-on-a-cd OS be patented?
Because if it can be, we have to secure it with something before a corporation patents it!
- Re:Should it be patented? (Score:2)
  
  by Exiler ( 589908 ) writes:
  
  Done.
  
  It's called prior art.
- Re:Should it be patented? (Score:2)
  
  by AlXtreme ( 223728 ) writes:
  
  Being the first (afaik), some people (no, hordes of people) have told me to do this, they believe such a patent would make me rich. I counter it with: If it wasn't free, nobody would use it. In only 11 months, there have been many people who have used Morphix to build their own livecd's, and that's the whole idea of the project. Make livecd's without having to rebuild the whole damn thing at every update.
  So, unless the borg get me, this is one patent that won't fly :)
  OT: Having said that, there seem to
  - Re:Should it be patented? (Score:2)
    
    by Progman3K ( 515744 ) writes:
    
    I didn't mean patenting it so it could be used for it could be charged for.
    I meant that that way, a corporation can't come along and patent it, even if the patent is not just.
    
    Take Microsoft patenting the long-filename extensions to FAT.
    
    It's NOT a just patent, but because they are a huge corporation, and can use lawyers to scare people, they'll probably get fees back from media manufacturers that ship their devices FAT32 ready-formatted anyway, because no one can afford to go to court to defend something l
Memories (Score:3, Interesting)

by gidds ( 56397 ) writes: <slashdot@gi d d s . me.uk> on Thursday December 11, 2003 @11:40PM (#7697489) Homepage

I remember when I was first let loose on a Unix system, and discovered tools like 'lex' and 'yacc' for lexical analysis and parsing. I was amazed that advanced language processing was so well supported - it was a short while before I discovered that they weren't for natural language processing :)

Share
twitter facebook
Natural languages useful for spam filters? (Score:3, Insightful)

by joelparker ( 586428 ) writes: <joel@school.net> on Thursday December 11, 2003 @11:48PM (#7697546) Homepage

Can anyone here comment on if/how
any of these natural language tools
can be helpful for spam filtering?
Cheers, Joel

Share
twitter facebook
- MOD THIS UP (Score:1)
  
  by AmVidia HQ ( 572086 ) writes:
  
  i want to know
- Re:Natural languages useful for spam filters? (Score:2, Interesting)
  
  by INT 21h ( 7143 ) writes:
  
  Lets see... if it had a good language guesser that could be fit into a plugin then we could toss all messages in languages we can't read (or see no use for), for instance all messages I get that are in English are either from some mailinglist, or spam. I've actually been working on a "spot English"-plugin to use on the mail that isn't automatically shunted into the mailinglist-folders, but if the work is already done, yay!
  
  You might think that looking at the charset used would be enough but 'taint so! Frequ
  - Re:Natural languages useful for spam filters? (Score:2)
    
    by dvdeug ( 5033 ) writes:
    
    You might think that looking at the charset used would be enough but 'taint so! Frequency of letters isn't good enough either, two good ways is checking for the most frequent words or the most frequent letter trigrams.
    
    Try looking at mguesser (http://mnogosearch.org). It's been quite accurate for me, but I've never tried it on spam.
Re: (Score:2, Informative)

by account_deleted ( 4530225 ) writes:

Comment removed based on user account deletion
- Re:The base Morphix (Score:2)
  
  by AlXtreme ( 223728 ) writes:
  
  Well, that _was_ the whole idea of the project now :)
  Have you joined our mailinglist/forums? It's great that all these derivatives are getting this much press [insert proud father-of-morphix photo], it would be even better to keep in touch and exchange bugreports & featurerequests. If you have, just ignore me, doing my best to get as much feedback as possible on the different modules...
There is a downside to Natural Language Processing (Score:2, Interesting)

by Anonymous Coward writes:

While NLP has many benefits, it can also freeze certain linguistic elements that should be removed or amended.

As a simple example, take spell checking. When the computer can remember the spelling for every word and fix it automatically, who is going to worry about spelling simplification or reform? Yet changing to a standardized phonetic spelling would probably help people in the long run, if only by allowing children time to actually *write* rather than spending time in rote memorization and spelling bees
and there's still a lot of place on the CD (Score:2, Funny)

by frovingslosh ( 582462 ) writes:

and there's still a lot of place on the CD
OK, I get that it's a Chinese scientist working on this, but it's about language. Should the Slashdot article really have been written in Chlinglish?
Random musings from an ex-linguist. (Score:5, Insightful)

by Charles Dodgeson ( 248492 ) writes: <jeffrey@goldmark.org> on Friday December 12, 2003 @01:46AM (#7698203) Homepage Journal

I'm a PhD drop-out in linguistics, and happen to know precisely what a head-lexicalized context-free grammer is. (And, no, reading Chomsky is not the way to find out what it is). Below are some random musings on the geekiness of linguists.
Linguists have always been geeky. Don't forget that Larry Wall is a linguist first.
The only computer class I ever took was in 1983 called "Computer tools for natural language analysis". It was an introductory Unix course. We learned grep, awk, sed as well as tools like vi, Mail, and rogue. And a tiny little bit of C. But since then I've taught C at the graduate level.
Linguistics is all about the reprensentation and manipulation of information. But instead of it being about languages we design for particular purposes, it is about the language system that we use naturally.
Suppose you have a few thousand languages that you know were written with the same tools (like lex and yacc, but not lex and yacc), but you have no access to those tools. Suppose you are trying to figure out what those tools are from examining the languages (not the compilers) that have been specified using those tools. That is what theoretical linguistics is trying to do. We know that the specification of English and the specification of Dyirbal and every other human language out there are somehow "written" with the same tools. It's pretty need stuff.
Linguists were early adopters of TeX, have had a Unix affinity for a while, and as people who are interested in how information is internally represented and manipulated, like reading the source.
I remember once nagging the sys admins to always make sure that there is a man page for anything added to /usr/bin or /usr/local/bin. The next day, they asked me to look at the manpage for something to see if it met with my approval. The DESCRIPTION was the C source. I was happy to say that it did, indeed, meet with my approval.
At one point, a well known professor (Geoffrey Pullum) had written a little essay for a newsletter on the "grammer of Unix" using linguistic style analyses of the shell. Naturally several of us feigned outrage at his confusion of "Unix" with the shell. Another linguist (Bill Poser), went so far as to write a shell which was verb (command) final, and post-positional. That is instead of saying
cat foo bar > bang
you would say
foo bar bang > cat
That is, the arguments preceed the command, and the redirect symbols go after the filename they redirect to or from. Now for various reasons, I had root access on a machine that Pullum used. So I changed his shell to this command final one. He actually caught on remarkably quickly. And after a quick
/bin/sh chsh
he was ready to concede the point.
For me, there is no surprise that linguists, and particularly computational linguists, are OSS enthusiasts. But that is enough of my random musings for now.

Share
twitter facebook
- Speaking of musings..... (Score:2)
  
  by gosand ( 234100 ) writes:
  
  Linguists have always been geeky. Don't forget that Larry Wall is a linguist first.
  Let's not forget about Douglas Hofstadter either. He has written some books I think every geek should read: The Mind's I, and Godel Escher Bach. If you can get through those, you should try Metamagical Themas. As melon-scratchers go, it's a honey-doodle.
  Funny story, that I am sure nobody cares about: My wife (then girlfriend) and I were both in a bookstore looking for books, and were in different parts of the store.
- Rogue? (Score:2)
  
  by DrCode ( 95839 ) writes:
  
  Hey, that was one of my favorite tools too, back in the 80's. Can't think of anything better for finding the Amulet of Yendor.
- Re:Random musings from an ex-linguist. (Score:2)
  
  by foqn1bo ( 519064 ) writes:
  
  Are you a former UCSC student?
- - Re:Random musings from an ex-linguist. (Score:2, Informative)
    
    by Charles Dodgeson ( 248492 ) writes:
    
    What are sources for the more interesting field journals/publications worthy of swot - care to make some suggestions?
    
    I dropped out 15 years ago, so I'm not really the best person to ask. For popular books on linguistics, I'd recommend The Language Instinct by Steven Pinker. (It is the book I wish I'd written). My favorite journal back in the days when I was reading them was Natural Language and Linguistic Theory.
    If you've had any contact, you'll know that linguistics is a bitterly divided field
what we need to do (Score:1)

by bash-2.02$ ( 732038 ) writes:

is figure out which one of us is going to download this and torrent it. all the rest of us will stop downloading immediately.

sure, thatll work

but seriously, my wife is very interested in linguistics(spanish major, almost a russian minor, some esl) and im curious as to how easy this will be to use for someone with no linux experience
NLP? (Score:2)

by anethema ( 99553 ) writes:

I totally read it as neural-linguistic programming. Or..maybe it IS really neural-linguistic programming, you listen to their cd for a while and you end up in one of these stories. [mcstories.com]
Where oh where? (Score:2, Funny)

by rock_climbing_guy ( 630276 ) writes:

Where are the "All Your Base" trolls when it's actually relevant?
Another disc (Score:2)

by frostman ( 302143 ) writes:

can be found here [rosettaproject.org].

It's either much harder or much easier to read, depending on your point of view.
How about translation of requirements to code ? (Score:2)

by master_p ( 608214 ) writes:

Since both are languages, can, for example, these tools be used for translation of software requirements to code ?
linguistics and computer science (Score:2)

by TheTick ( 27208 ) writes:

Back when I was an undergrad, I was taking Principles of Compiler Design in one building on campus and Principles of Linguistics in another. However, the division seemed purely arbitrary.

In Compiler Design we were learning all about lexical analysis, parse trees, and context free grammars. In Linguistics we were learning all about...lexical analysis, parse trees, and context free grammars. It was really interesting taking the two classes back-to-back, and observing the similarities (and differences).
Don
- Re:Good Chinese Compression (Score:5, Funny)
  
  by MoThugz ( 560556 ) writes: on Thursday December 11, 2003 @10:38PM (#7697110) Homepage
  
  If you want to play the typical stereotype... please at least get it right.
  
  It's the Japanese who has problems pronouncing L's... and the Chinese have problems pronouncing R's.
  
  The Westerners on the other hand, can pronounce almost anything, but will never ever get facts right :)
  
  Parent Share
  twitter facebook
  - Re:Good Chinese Compression (Score:2, Insightful)
    
    by log2.0 ( 674840 ) writes:
    
    I would say that westeners can not pronounce simple Chinese.
    
    English is the only language I know but I studied Mandarin chinese for a few years.
    
    There are all sorts of things in there that we have a lot of trouble pronouncing.
    - Re:Good Chinese Compression (Score:2)
      
      by aldousd666 ( 640240 ) writes:
      
      It's becuase when you are born, you're capable of pronouncing anything. As you grow up listening to the sounds of those around you talking, your brain 'tunes in' to sounds you perceive as relevant and hear often. All other sounds are treated as background noise. So -- Americans raised on English not only have trouble speaking chinese, but they have trouble hearing it correctly. Look at the Nguni languages of certain African tribes. They searialize syllables consisting of clicks and ticks of the tongue t
  - Re:Good Chinese Compression (Score:2)
    
    by frovingslosh ( 582462 ) writes:
    
    It's the Japanese who has problems pronouncing L's... and the Chinese have problems pronouncing R's.
    No, I specifically remember Maxwell Smart's old Chinese enemy, the Craw!
  - - Re:Good Chinese Compression (Score:1, Insightful)
      
      by Anonymous Coward writes:
      
      there are something like 80 phones of linguistic merit capable of being produced by humans. english has like 40, i think.
      
      and any linguist will tell you, it's impossible to pronounce something wrong...linguistics is a descriptive, positive science as opposed to a normative, prescriptive science. NO ONE speaks wrong, unless they think they do, i.e. a speech error
      
      i didn't try very hard in ling 101, it was so easy....
- actualy (Score:2, Funny)
  
  by DrLZRDMN ( 728996 ) writes:
  
  no states have laws like that, this summer Texas ditched theres, they were the last to do so
  stiff sodomy laws? theres a joke in there somewhere...
- Re:How many of you really support OSS? (Score:2)
  
  by sisukapalli1 ( 471175 ) writes:
  
  I would first try to put the things in perspective. NLP is a relatively esoteric field, and most common "techies" wouldn't be so keen on delving into the internals.
  
  You should take a look at posts on Mozilla, KDE, and GNOME, and you will see that people do get behind OSS for reasons other than elitism.
  
  S
  - Re:How many of you really support OSS? (Score:2)
    
    by citog ( 206365 ) writes:
    
    I appreciate that, I guess this morning I'm in bad humour. Must have picked the wrong articles to read first :) There are just a lot of times when I think the support of OSS isn't motivated by the philosophy rather it is being used as a stick too often.
- Slashborging (Score:3, Funny)
  
  by PurpleBob ( 63566 ) writes:
  
  Wow. That's the first slashborging ("All Slashdotters should have the same opinions! Be consistent, dammit!") post I've seen in a long time.
  
  Even though they're stupid as hell, I was beginning to miss them.
  - Re:Slashborging (Score:1)
    
    by citog ( 206365 ) writes:
    
    No, I referred to a section of the Slashdot community. The ones that jump in at the start of the discussion with drivel. There is a lot of valuable input from members of the remaining section however it is frequently lost in the nosie.
    
    My posting was not a "All Slashdotters should have the same opinions!". Read it again and you might see that the sentiment is 'support not subvert' the OSS movement.
- - Re:How many of you really support OSS? (Score:2)
    
    by citog ( 206365 ) writes:
    
    Concise and decisive at least.. :)
- Re:Zhang Le is a cunning linguist ! (Score:2)
  
  by ballpoint ( 192660 ) writes:
  
  This [nlplab.cn] is a picture of a Linux-loving cunning linguist.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Ironic.. (Score:5, Funny)

Re:Ironic.. (Score:1, Funny)

Re:My wish (Score:2)

Noooo (Score:5, Funny)

Re:Noooo (Score:2, Informative)

Re:Noooo (Score:2)

that's pretty cool (Score:3, Insightful)

It's actually useless for that (Score:5, Interesting)

Re:It's actually useless for that (Score:1, Informative)

Re:It's actually useless for that (Score:4, Informative)

Not really. there's already UG (Score:1)

Re:that's pretty cool (Score:5, Interesting)

Linguistics and Anthropology (Score:2, Informative)

Great... (Score:2, Funny)

Re:Great... (Score:5, Funny)

Re:Great... (Score:1)

Re:Great... (Score:2)

Re:Great... (Score:3, Interesting)

Re:Great... (Score:2)

Aren't patents written in that? (Score:1)

Re:Aren't patents written in that? (No) (Score:2)

Re:Great... (Score:4, Interesting)

Re:Great... (Score:1)

Clarification: Controlled Language [Re:Great...] (Score:2, Informative)

Re:Clarification: Controlled Language [Re:Great... (Score:2)

Re:Great... (Score:2)

First application: (Score:2)

So this means (Score:3, Funny)

Anyone remember Forum 2000? (Score:3, Interesting)

Re:Anyone remember Forum 2000? (Score:4, Informative)

Re:Anyone remember Forum 2000? (Score:2)

Re:Anyone remember Forum 2000? (Score:4, Informative)

Re:Anyone remember Forum 2000? (Score:2)

Forum2000 is dead. Long live Forum 2010! (Score:4, Informative)

Re:Forum2000 is dead. Long live Forum 2010! (Score:2)

Why Linux is great for doing applied linguistics? (Score:5, Informative)

Re:Why Linux is great for doing applied linguistic (Score:1)

eckcha isa outa (Score:1, Funny)

Alright! (Score:1)

Re:Alright! (Score:1)

Download Link (Score:4, Informative)

Chomsky and stuff (Score:2, Interesting)

Re:Chomsky and stuff (Score:2, Insightful)

Re:Chomsky and stuff (Score:1)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:5, Interesting)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:4, Informative)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:1, Insightful)

Re:Chomsky and stuff (Score:2)

Re:Chomsky and stuff (Score:5, Interesting)

Re:Chomsky and stuff (Score:3, Informative)

Do you honestly believe that? (Score:3, Interesting)

Re:Do you honestly believe that? (Score:2)

Re:Do you honestly believe that? (Score:1)

Re:Do you honestly believe that? (Score:2)

Re:Do you honestly believe that? (Score:2)

Common in mathematics (Score:2)

Re:Chomsky and stuff (Score:2)

Omission of Gate (Score:3, Informative)

Re:Omission of Gate (Score:1)

Should it be patented? (Score:2)

Re:Should it be patented? (Score:2)

Re:Should it be patented? (Score:2)

Re:Should it be patented? (Score:2)

Memories (Score:3, Interesting)

Natural languages useful for spam filters? (Score:3, Insightful)

MOD THIS UP (Score:1)

Re:Natural languages useful for spam filters? (Score:2, Interesting)

Re:Natural languages useful for spam filters? (Score:2)

Re: (Score:2, Informative)

Re:The base Morphix (Score:2)

There is a downside to Natural Language Processing (Score:2, Interesting)

and there's still a lot of place on the CD (Score:2, Funny)

Random musings from an ex-linguist. (Score:5, Insightful)

Speaking of musings..... (Score:2)

Rogue? (Score:2)