Improving Open Source Speech Recognition

Improving Open Source Speech Recognition 121

Posted by kdawson on Tuesday October 10, 2006 @04:10PM

kmaclean writes, "VoxForge collects free GPL Transcribed Speech Audio that can be used in the creation of Acoustic Models for use with Open Source Speech Recognition Engines. We are essentially creating a user-submitted repository of the 'source' speech audio for the creation of Acoustic Models to be used by Speech Recognition Engines. The Speech Audio files will then be 'compiled' into Acoustic Models for use with Open Source Speech Recognition engines such as Sphinx, HTK, CAVS and Julius." Read on for why we need free GPL speech audio.

Why free GPL Speech Audio?

Speech Recognition Engines require two types of files to recognize speech. The first is an Acoustic Model, which is created by taking a very large number of audio recordings of speech and their transcriptions (called Speech Corpus or Corpora) and 'compiling' them into statistical representations of the sounds that make up each word. The second is a Language Model or Grammar file. A Language Model is a very large file containing the probabilities of certain sequences of words. A Grammar is a much smaller file containing sets of predefined combinations of words.

Most Acoustic Models used by 'Open Source' Speech Recognition engines are 'closed source'. They do not give you access to the speech audio (the 'source') used to create the acoustic model, or if they do, there are licensing restrictions on the distribution of the 'source' (i.e. you can only use it for personal or research purposes). The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines. Open Source projects are required to purchase Speech Copora which has restrictive licensing — i.e. they are not permitted to distribute the 'source' speech audio, but they are permitted them to distributed the 'compiled' Acoustic Model.

Why GPL?

A GPL-style license will ensure that user contributions will always benefit the open source community, since it requires any distribution of derivative Acoustic Models to include access to the 'source' speech audio.

Improving Open Source Speech Recognition

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 121 Comments Log In/Create an Account

Comments Filter:

- - Re: (Score:2, Interesting)
    
    by markwalling ( 863035 ) writes:
    
    telephone services like tell me (18005558355), and my bank (USAA) work fairly well. my old bank had a touch tone system which was hard to use while driving. the error rate of my new bank's system is fairly low.
    
    but agreeing with you, the voice system in my cell phone sucks.
    - What is your choice?..."Operator"...I'm sorry. (Score:2, Insightful)
      
      by PRMan ( 959735 ) writes:
      
      What is your choice?..."Operator"...I'm sorry. Please say another option...."CUS-TO-MER SER-VICE REP-RE-SENT-A-TIVE!!!"...I'm sorry...
      That's usually the gist of my conversation with those automated systems.
      If I'm calling, it's not something that can be solved with an automated prompt. If it was, I would have looked it up on your website already... I'm calling specifically because there's something WRONG with my account!
      - Re: (Score:2)
        
        by gregmac ( 629064 ) writes:
        
        Swear at it.
        
        If you do that on Bell Canada's system (well, I haven't tried in about a year, but it did then) it will drop you directory to an operator.
      - Re: (Score:1)
        
        by aichpvee ( 631243 ) writes:
        
        I'm mute, those insensitive clods!
    - - Re: (Score:2)
        
        by Bloater ( 12932 ) writes:
        
        > The trick is getting a recognition system to recognize many voices in a large domain.
        
        Powergen in the UK had a system where, when paying a bill by credit card, it would ask for the name on the card. I have an unusual name, and it would get it fine everytime. Although, being an AI graduate, I'm used to speaking in a manner that typical analysis algorithms can process well.
- Re:A sound affair. (Score:4, Insightful)
  
  by k12linux ( 627320 ) writes: on Tuesday October 10, 2006 @07:10PM (#16385231)
  
  I would love to have quality Vox software for use in schools vs paying handsomely for proprietary stuff. The disabled children who use it would be grateful too since we wouldn't be restricted to installing only on 2% of the PCs in a school without breaking our budget.
  
Muffin for Jew to Ski here? (Score:2)

by Stripsurge ( 162174 ) writes:

I don't think that's quite right. I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice. That and you get complete gibberish half the time.

Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?
- Re: (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  One of the guys in my class last year wrote a dj application that used a mic in which you could speak your commands into. It could find you music based an genre, artist, song title and lots of other stuff. The cool part about it was that it would announce the songs as well as any commands it was currently doing. He had it running on his laptop using the new speech engine in vista. It was really really cool and worked very well. Having an opensource tool to do stuff like this would be fantastic.
- Re: (Score:3, Interesting)
  
  by vertinox ( 846076 ) writes:
  
  Are there people out there who use voice as their main method of inputing text? For older people who type incredibly slow would this software be worthwhile using for composing emails?
  
  I knew a few old people who asked about it and tried it, but I think the real holy grail for voice recognition is not a replacement for typing text, but for rather understanding context of what you are wanting it to do.
  
  You know... "Computer go to Red Alert!" like Star Trek.
  
  But in our case it would be...
  
  "Computer. Go to email an
  - Re: (Score:2)
    
    by sik0fewl ( 561285 ) writes:
    
    "Computer. Go to Slashdot and alert me if there is a dupe."
    
    Well, it wouldn't take much [artificial] intelligence for that one.
  - Computer. Go to Slashdot ... (Score:2)
    
    by fahrbot-bot ( 874524 ) writes:
    
    "Computer. Go to Slashdot and alert me if there is a dupe."
    Error: Insufficient computing power.
    - Re: (Score:2)
      
      by TheRaven64 ( 641858 ) writes:
      
      A function that always returns true shouldn't take much processing power...
  - Re: (Score:2)
    
    by hackstraw ( 262471 ) * writes:
    
    how about make world?
  - Re: (Score:2)
    
    by Olivier Galibert ( 774 ) writes:
    
    70-80% accuracy (well, 20-30% error rate) seems enough in practice. 100% never happens, even for humans.
    
    OG.
    - Re: (Score:2)
      
      by CrazedWalrus ( 901897 ) writes:
      
      True, but humans then use outside context to figure out the missing/misunderstood words. If the computer could use the 70-80%, combined with a larger context than the current phrase to infer an additional 15% like humans do (pulling numbers out of my hindquarters), that'd be the key.
      
      I look at the subject "Muffin for Jew to Ski here?" and use both my knowledge of Slashdot and of similar-sounding words to infer what the writer is getting at. The knowledge of Slashdot is an important factor in my accuracy in d
- Re: (Score:3, Interesting)
  
  by AJWM ( 19027 ) writes:
  
  I remember messing around with voice recognition in the 90s but the CPU power wasn't there to do real time voice.
  
  Depends on the approach. I recall circa 1980 or so a prof at Concordia U. had a speech recognizer on a VAX 11/780 (with an A/D adapter). It didn't have to be trained on the speaker, and recognized my "Mary had a little lamb, its fleece was white as snow"(*) in a mere 10 or 15 minutes.
  
  Okay, hardly real time, but that was on a 1 MIPS machine. It was also logging all the steps it took to analyze
  - Re: (Score:1)
    
    by Laserwulf ( 951642 ) writes:
    
    What would be better than real-time for voice recognition software? Predicting what you're going to say?
    - Re: (Score:2)
      
      by AJWM ( 19027 ) writes:
      
      What would be better than real-time for voice recognition software?
      
      Doing speech-to-text from a speeded-up recording, or simultaneously doing multiple transcripts from different audio inputs. Or doing .wav file to text in less time than playing the .wav file takes.
- Re: (Score:3, Informative)
  
  by jthayden ( 811997 ) writes:
  
  I remember messing around with voice recognition in the 90s
  
  The article is about speech recognition as is your post. Speech recognition is about recognizing what was said. Voice recognition is about recognizing who said it. The distinction is important since the coding and the problems associated with them are very different.
Just what we need... (Score:2, Funny)

by benzzene ( 755902 ) writes:

Aren't people recognising open source speech well enough already? Perhaps we need to tone down the zealotry.
- Re: (Score:1)
  
  by RobertLTux ( 260313 ) writes:
  
  this is (open source) speech recognition not (open source speech) recognition
- Re: (Score:2)
  
  by TeknoHog ( 164938 ) writes:
  
  What?
Anythings gotta be better than (Score:5, Funny)

by LiquidCoooled ( 634315 ) writes: on Tuesday October 10, 2006 @04:19PM (#16382963) Homepage Journal

Dear Aunt, let's set so double the killer delete select all.

- Re: (Score:2)
  
  by rts008 ( 812749 ) writes:
  
  Yeah, that sums it up pretty well.
  
  The reality of Star Trek- like voice interaction with a computer is still a ways off- decades perhaps.
  - Re: (Score:2)
    
    by chochos ( 700687 ) writes:
    
    And this is one way of helping to make it a reality sooner...
- Re: (Score:2)
  
  by Electrum ( 94638 ) writes:
  
  What really [msdn.com] happened [msdn.com] during the speech demo.
GPL? (Score:3, Interesting)

by PhrostyMcByte ( 589271 ) writes: <phrosty@gmail.com> on Tuesday October 10, 2006 @04:19PM (#16382969) Homepage

Wouldn't a Creative Commons license be better for this? Correct me if I'm wrong but GPL was made for code, not audio.

- Re: (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  At least "Creative Commons Attribution-NonCommercial-NoDerivs 2.5" probably won't do if you consider some models to be derivatives of the audio samples.
  - Well no, not *that* one. (Score:3, Insightful)
    
    by Kadin2048 ( 468275 ) writes:
    
    Well that particular CC license would be particularly bad (actually I don't know what it would be good for, might as well just say "All Rights Reserved" and save space), but there are others that would be fine.
    
    Creative Commons ShareAlike is GFDL compatible, at least according to WikiMedia. Or heck, why not just use the GFDL itself?
    
    The reason not to use the GPL on something like this is because there's not a clear separation between "source" and "binary" like there would be for a programming project; there's
- Re: (Score:2, Informative)
  
  by cheater512 ( 783349 ) writes:
  
  Actually the summary hints at this but the GPL fits rather nicely.
  
  There is the 'source' data which is 'compiled' in to something useful.
  Sounds familiar?
  - Re: (Score:2, Interesting)
    
    by SpokenLang ( 792981 ) writes:
    
    The difference between using audio data to "compile" an acoustic model, and using source code to compile an executable is that when you create acoustic models from audio data, you don't modify the acoustic data, you use it "as is". So, it doesn't really make sense to require me to distribute an identical copy of the data along with my acoustic models.
    On a related note, how much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz
    - Re: (Score:1)
      
      by cheater512 ( 783349 ) writes:
      
      Actually the compiled data is just statistical data based on the audio. The audio isnt directly used.
      You cant get the original audio from the compiled version.
      
      What rock have you been hiding under? Dont you know what a mp3 is? ;)
      I highly doubt they are even considering distributing raw pcm data. It will be compressed in one form or another.
      1000 hours of CD quality mp3 is only roughly 60gig (your numbers are wrong I think) and voice doesnt need CD quality.
      Anyway they dont *need* to distribute the audio to eve
      - Re: (Score:1)
        
        by SpokenLang ( 792981 ) writes:
        
        Actually the compiled data is just statistical data based on the audio. The audio isnt directly used. You cant get the original audio from the compiled version.
        Yes, I realize that you can't recover the audio from the acoustic models. But my point was that using the GPL in this context seems wrong because it would require that I (as the builder of the "derivative" work, aka the acoustic models) make the audio available (unless I misundertand the GPL.) So, given that I haven't changed the original audio in t
        
        Re: (Score:1)
        
        by cheater512 ( 783349 ) writes:
        
        Uncompressed audio is 700mb for 80 mins. Look on a pack of blank cds. Your math is wrong. :P
        
        Flac is a good candidate for a format. Its open source and lossless.
        
        Re: (Score:1)
        
        by SpokenLang ( 792981 ) writes:
        
        Ah, right... my numbers were WAY too low! ;-)
- IFA Dutch Corpus (Score:4, Informative)
  
  by finiteSet ( 834891 ) writes: on Tuesday October 10, 2006 @09:15PM (#16386405)
  
  Correct me if I'm wrong but GPL was made for code, not audio.
  There is more to it than the poster mentions (I don't know if the site addresses this - it is Slashdotted). You don't just need audio - speech audio is abundant - you need annotated audio. In most cases, this annotation is phonetic (or phonemic) transcription, which labels segments of the audio according to the speech sound present in that audio segment. Most state-of-the-art speech systems use a machine learning approach: the system is "trained" on training data, with the hopes that the patterns learned generalize well on new data. This training is a supervised process: it requires the answers, and the answers are found in the annotation. It is this combination of audio and annotation that is valuable, and that is hard to come by. If their system prompts you to read phrases, it could use an existing recognition system to produce a roughly aligned phonetic transcription. It would be far from perfect, but useful nonetheless.
  
  From TFA:
  The reason for this is because there is no free Speech Corpora in a form that can readily be used to create Acoustic Models for Speech Recognition Engines.
  What? The IFA Dutch "Open-Source" Corpus [hum.uva.nl] is a phonemically-annotated speech corpus released under the GNU GPL (read more - pdf). [let.uva.nl] They even have an SQL interface. Did you mean English speech corpora?
  
  - Re: (Score:1)
    
    by oergiR ( 992541 ) writes:
    
    The text people have to read is given. I.e. the orthographic transcription is available. It is possible to bootstrap a speech recognition system from these transcriptions. It will not be particularly good, though.
    
    The more important problem is that current speech recognisers do not generalise well. If you train only on read speech, the performance on spontaneous speech will most likely be horrible. Transcribing spontaneous speech, however, takes enormous amounts of time. And it is not the kind of job you wan
    - Re: (Score:1)
      
      by finiteSet ( 834891 ) writes:
      
      Transcribing spontaneous speech, however, takes enormous amounts of time. ... So I don't see how a good speech recogniser can be produced without money.
      Aha, that's what undergraduate RAs (+lots of funding) are for. But seriously, this is really what I was getting at in my post.
      The best systems produced at the institution where I am studying are trained on about a thousand hours of speech.
      An IFA Corpus trained system won't be state-of-the-art, admittedly. The key word here is "free" - beggars can't be
It's about time (Score:5, Informative)

by jesuscyborg ( 903402 ) writes: on Tuesday October 10, 2006 @04:21PM (#16382989)

Improving open source speech rec and tts will be a HUGE improvement in the grand scheme of progress as far as human-computer interaction is concerned. The main reason is because Nuance has a near monopoly in this market and they charge INSANE licensing fees to do anything with their technology. Whenever closed-source competition comes along, they just buy them out. Heck, their sales people even talk down to you on the phone because they know they're the only game in town.

Having a viable open source alternative will ensure that everyone has access to this technology and there will be many new innovations that will just continue to make technology cooler.

Please people, take the time out of your schedule to record prompts. It will do everyone a lot of good.

- Re:It's about time (Score:5, Interesting)
  
  by smilindog2000 ( 907665 ) writes: <bill@billrocks.org> on Tuesday October 10, 2006 @05:10PM (#16383767) Homepage
  
  I'll pitch in. I lost the use of my hands for three years due to a repetitive motion injury, and had to code by voice. That was 1997, nine years ago. I figured that within a couple years, the technology would be so great I would out-code my peers. Then the web bubble came, and Dragon Systems lost their focus on helping disabled people and focused instead on letting people dictate to Word. The creators of this great technology eventually sold out and moved on. Nine years later, the best product for coding is nine years old: the original Dragon Dictate. It doesn't even use the CPU for it's signal processing: it runs that on the sound card because in the early 90's the sound card had more power.
  
  We've gotta do something to get this beast moving forward.
  
  - Re: (Score:1)
    
    by jacquesm ( 154384 ) writes:
    
    holy crap, you ok now ?
    - Re: (Score:2)
      
      by smilindog2000 ( 907665 ) writes:
      
      Yes, I eventually recovered, after switching to a laptop (which I actually use in my lap), and after having a child. Half of the problem with many repetitive motion injuries is stress, and having a family refocused mine.
      - Re: (Score:2)
        
        by jacquesm ( 154384 ) writes:
        
        ok, glad to hear that.
        
        For the longest time I had a speech recognition system sketched out on a whiteboard in
        my office, maybe once I get done with all my current projects (ww.com, daz.com and a bunch
        of smaller stuff) I'll restart it, it's one of the things I really don't like about
        computers, the fact that our whole 'navigation' experience and knowledge seems to
        revolve around large surfaced displays. If we could somehow get rid of that I think
        computers would be *far* more useful.
        
        best regards, & congratula
two modes of speech-to-text, also (Score:5, Informative)

by Speare ( 84249 ) writes: on Tuesday October 10, 2006 @04:24PM (#16383037) Homepage Journal

It's helpful to understand that there are two very different modes of speech recognition.

Continuous speech to text is useful for flat-out dictation, where the speaker should be allowed to speak in a clear voice at a normal or almost normal speed, saying whatever the speaker wants to say, and the result should be a sequence of recognized words.

Prompted speech to text is useful if your program is trying to exchange a dialogue with a human, such as a voice prompt or simply a set of useful user-voice commands. In this mode, the listening routine has a set of "expected" responses and should try only to recognize one of those responses.

The latter form usually requires a lot less training from an individual human, and is more robust in noisy environments, since the range of recognized expression is very tightly controlled to a few possibilities. The former mode, continuous speech, is much harder to accurately recognize without personal training for each human speaker, or significant statistical work in background processing.

Here's how to help out (Score:5, Informative)

by schwaang ( 667808 ) writes: on Tuesday October 10, 2006 @04:28PM (#16383077)

Record Your Speech and Submit it to VoxForge [voxforge.org]

Donate your speech for a GPL speech data collection so they can do better recognition.

Includes seperate instructions for windows and linux users. (Wonder if there will be any significant differences in the quality of the data based on OS...)

- Re:Here's how to help out (Score:4, Funny)
  
  by lawpoop ( 604919 ) writes: on Tuesday October 10, 2006 @05:16PM (#16383875) Homepage Journal
  
  I would bet that more of the people who are using Linux are on the Autistic Spectrum [wikipedia.org]. A few of the 'symptoms' or qualities of such people include "Odd or monotonous prosody of speech" and "Overly formal and pedantic language".
  
  So my bet is yes, there will be a difference based on OS.
  
Wreck A Nice Beach... (Score:1)

by onkelonkel ( 560274 ) writes:

Wreck A Nice Beach...Recognize Speech. Call me when it can tell them apart.
- Re: (Score:3, Informative)
  
  by bdwoolman ( 561635 ) writes:
  
  wreck a nice beach
  recognize speech
  Entered with Dragon Systems 9.
  Not trying to be snotty. Just informative. Dragon Systems has been pretty good since version 7. Eight was a real improvement. Nine is totally awesome. Almost magic. There is a user learning curve, however. One does have to dictate the punctuation for example. Nine works with Firefox very nicely.
  Wordos do happen from time to time when you 'wreck a nice beach' (sic) , but then so do typos. Everything needs to be edited no matter h
  - Re: (Score:2)
    
    by onkelonkel ( 560274 ) writes:
    
    OK, my comment was a bit snotty. I'm somewhat entitled to be dubious, because speech recognition was a big disappointment when it first hit the market.
    
    I remember eagerly anticipating not having to type anymore when I bought IBM's Via Voice. This was about 10 years ago, back when "powerful computer" meant a P90 with 8 Meg Ram. After training the software for about an hour, I could, by. talking. like. William. Shatner. on. Ritalin. produce text that was maybe 60% - 80% accurate. It was definitely oversold
    - speech recognition (Score:1)
      
      by bdwoolman ( 561635 ) writes:
      
      Didn't think your comment was snotty at all. I was just worried mine might be perceived as such.
      I think you will be pleasantly surprised if you try Dragon Systems. Dragon Systems is special among speech engines. It is the long-term pet project of a couple of gifted scientists who decided to solve the problem of speech recognition a generation ago. They filed hundreds of patents over a couple of decades and solved many engineering problems one at a time. IBM took a long term interest in speech recogn
- But you didn't leave your number (Score:1)
  
  by GeorgeVW ( 599773 ) writes:
  
  Fired up iListen from MacSpeech (who license the Philips Voice Recognition model). Spoke both phrases in normal pace and tone. Initial accuracy 75%. Take 30 seconds to correct errors. Accuracy 100%. Even before training/correction, "recognize speech" was at 100%. The training was to teach the difference between "wreck" and "rack" (although it offered "wreck" as one of the options in the correction mode).
  
  It ain't perfect, but training is easy these days and accuracies over 95% are arrived at fairly quickly.
Data conditioning (GIGO) (Score:5, Insightful)

by StateOfTheUnion ( 762194 ) writes: on Tuesday October 10, 2006 @04:30PM (#16383117) Homepage

What about data conditioning?
This project seems to be gathering a "Wild Type" sampling of submitted data. What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . .

- Re: (Score:2)
  
  by suv4x4 ( 956391 ) writes:
  
  Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out).
  
  They had to invent an acronym for this too, didn't they!? Jesus what is going on with this world!
  
  Wait... who are they?...
  - GIGO is older than you (Score:2)
    
    by ClosedSource ( 238333 ) * writes:
    
    Do you realize that GIGO is an older acronym than 99% of all the acronyms you've read on Slashdot.
    
    So your question should be: "Jesus what was going on this world way back when before I was born!"
- Re: (Score:3, Interesting)
  
  by NamShubCMX ( 595740 ) writes:
  
  But shouldn't there be many different accent for such a program to work?. I am french canadian, and I sure hope I don't have to imitate a France accent for my voice to be faithfully recognized. Because although I have a strong accent from the point of view of French people, I don't from the point of view of Quebec people.
  
  On the same line of thought, I hope I can use this tool with my heavy (ok not so bad) english accent...
  
  I have no clue how those programs work so I might be off-base, but it seems to me that
- Re: (Score:2)
  
  by davids-world.com ( 551216 ) writes:
  
  You don't need Chinese people to get heavily accented English. In fact, English varies a lot. If you're in Yorkshire (England) or in Ayrshire (Scotland), in Singapore, Brisbane or Nashville, you'll find extremely different accents. A good speech corpus will contain large samples of as many accents as possible, including meta-data that allows people to filter this to produce an acoustic model that is tailored to intended target users.
  
  But the same applies to recording modalities. Depending on whether you're b
  - Re: (Score:2)
    
    by penix1 ( 722987 ) writes:
    
    So I guess what you are saying is my Mr. Microphone won't cut it? Damn! Back to the drawing board.
    
    B.
    - Re: (Score:2)
      
      by davids-world.com ( 551216 ) writes:
      
      hey better than nothing. or as the statistical NLP people tend to say: there's only one thing that's better than data. more data!
- Re: (Score:2)
  
  by mrchaotica ( 681592 ) * writes:
  
  What if the data is not representative . . . for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented
  
  Then it would work great for Chinese users of the software. I don't see a problem here, except that the data needs to be categorized properly.
- Re: (Score:2)
  
  by JumperCable ( 673155 ) writes:
  
  "Without controlling the data source or making sure that the data is valid, one could become a victim of GIGO (Garbage In, Garbage Out). In all fairness, this may not be a problem if the sample size is large enough to overwhelm any outlying data, but I'm not sure that this project has sufficiently addressed this concern . . ." -StateOfTheUnion
  
  Part of the submission process is for you to classify your dialect. After your recordings are posted, people can rate your recordings & comment on them. I thi
- Re: (Score:1)
  
  by Sam the Nemesis ( 604531 ) writes:
  
  for example, a bunch of people in China decide to submit english language files with the best of intentions, but the data is heavily accented (Or to be fair, if a bunch of native English speakers submitted a bunch of heavily accented recordings of Mandarin speech)?
  What I feel is that for English speech, this is exactly what they should do. The language is so common across the globe, that all the accented variants should be included in the speech database. I remember using a voice recognition software long
slashdot met voxforge (Score:3, Funny)

by bfree ( 113420 ) writes: on Tuesday October 10, 2006 @04:34PM (#16383181)

as if a million voices cried out in terror and were suddenly silenced

- Re: (Score:3, Funny)
  
  by SeaFox ( 739806 ) writes:
  
  as if a million voices cried out in terror and were suddenly silenced
  
  But thanks to those millions of samples, we can now transcribe "AHHHHHHHHHHHHHHHH!" very accurately.
- Re: (Score:3, Insightful)
  
  by AngryUndead ( 733008 ) writes:
  
  Do you remember when you actually had to go to the office to handle things these systems are used for? Talk to a human? Do you remember a time when they just punched you in the bean bag for trying to quit?
  
  The developers who work overtime to bring such advances should damn near be nominated for saint-hood. Or maybe you could learn to enunciate.
GPL versus public domain? (Score:5, Insightful)

by 5plicer ( 886415 ) writes: on Tuesday October 10, 2006 @04:35PM (#16383203)

Why not make the files public domain? Is making them GPL really necessary?

- Re: (Score:1)
  
  by RobertLTux ( 260313 ) writes:
  
  the difference between PD and GPL can be like this
  
  1 PD is like a public park with the problem that somebody could buy a certain section (say grease a few palms and..) and lock YOU out of it
  2 GPL is like a park owned by some old looney that leaves the gate open (or in some cases owned by a group of folks that HATE Each other)
  PD is free now but could be nonfree later
  GPL is free FOREVER
  
  (for some projects its like getting a jew a muslim a catholic and several subtypes of protestants to agree on a "winter holida
  - Re: (Score:2)
    
    by Em Adespoton ( 792954 ) writes:
    
    Well, that's not exactly true. Think of it more like this:
    Public Domain is forever, but people are free to copy it and make the copy (plus all improvements) their own.
    GPL is forever, but the only people free to distribute it are those who provide all the original source IP plus their modifications under the GPL.
    Then, of course, there's also BSDL, where the only restriction is that, unlike the public domain, you are required to credit the original authors of any work you use.
- Re: (Score:2)
  
  by Britz ( 170620 ) writes:
  
  Parent is a troll.
  
  But what the heck: BSD vs. GPL, let me just get my flameproof stuff.
Please (Score:3)

by porkThreeWays ( 895269 ) writes: on Tuesday October 10, 2006 @04:46PM (#16383385)

It'd be nice if someone could give an overview of the quality and simplicity of some open source speech recognition projects. I've used sphinx 2,3, and 4 before with little luck. I don't know if I got marbles in my mouth or what. Either way, I'm sure there's got to be someone on slashdot who's used a few and could give an overview to us weekend warriors.

Speech to text overlooked (Score:3, Informative)

by mgkimsal2 ( 200677 ) writes: on Tuesday October 10, 2006 @05:11PM (#16383791) Homepage

I wrote a bit about this (somewhat negatively) at http://fosterburgess.com/kimsal/?p=139 [fosterburgess.com] a few days ago. I've been looking for a solid option for having some dictation automatically transcribed to text files, and have this run under Linux. Basically, anyone looking to do this is just out of luck. It'll be years before there's anything useable for the average person. In my post, I reference another article (http://www.theinquirer.net/default.aspx?article=3 4072) which also talks about the state of things.

What's frustrating is that there *was* something halfway decent - IBM's ViaVoice - but that's gone. A few of the Linux apps I see out there are layers to run on top of ViaVoice. With that option gone for Linux, those tools are useless. It's like the rug was pulled out from underneath any progress in this arena for the foreseeable future.

I found voxforge a few days ago, and while it seems admirable, it's a small part of the larger problem which I don't see getting any better any time soon.

- Re: (Score:3, Informative)
  
  by Wumpus ( 9548 ) writes:
  
  IBM's ViaVoice technology can still be licensed from Wizzard Software (http://www.wizzardsoftware.com) and they're still selling the Linux SDK.
  - Re: (Score:2)
    
    by mgkimsal2 ( 200677 ) writes:
    
    AFAICT it's a bit out of my price range - cheapest price I can see if $3400.
    - Re: (Score:2)
      
      by Wumpus ( 9548 ) writes:
      
      Well, it is a server product, targeted at developers. Desktop speech products weren't doing too well the last time I looked.
It's about time (Score:3, Interesting)

by pestilence669 ( 823950 ) writes: on Tuesday October 10, 2006 @05:12PM (#16383805)

I've been waiting for something like this for a long time... Hiring voice actors isn't always feasible. I can work on engine code no problem, but my voice isn't the prettiest. Without repositories like this, projects like Sphinx can have a considerable barrier to entry for the uninitiated. The variety in sources can only improve quality.

- Re: (Score:2)
  
  by kent_eh ( 543303 ) writes:
  
  Hiring voice actors isn't always feasible.
  
  Nor, in this case I suppose, especially desirable.
  I would expect that you would want a wide variety of voised to train the thing.
  
  Listen to the genreral public sometime, do many of them sound like they have "professionally trained" voices?
  Probably not.
  
  If you train a speech rec. engine with "golden" voices, how can it be expected to figure out the average Joe/Jane on the street?
  
  I routinely hear from customers whose accent (or manner of enunciation) makes it nearly
  - Re: (Score:2)
    
    by kent_eh ( 543303 ) writes:
    
    but if there was a volunteer voice corpus being assembled
    ...And of course, there is.
    (thwacks self on forehead, while chanting RTFA)
    
    Now where did I put that microphone....
Ill help (Score:2)

by Anon-Admin ( 443764 ) writes:

Ill have to go back when it is not slashdoted. I am not sure what they need but if they just need a good voice reading something ill give it a try. I have been told that I should like the guy on movie phone. :)

I would also love to seen open source dictation software. [shameless plug](See my journal for why)[/shameless plug]
- Re: (Score:1)
  
  by finiteSet ( 834891 ) writes:
  
  I am not sure what they need but if they just need a good voice reading something ill give it a try. I have been told that I should like the guy on movie phone. :)
  Speech recognition performance on low-noise, read "proper" speech is actually impressively good. The forefront of speech recognition research is on noisy, spontaneous and conversational speech - i.e. real world speech. Any speech data is helpful, but the state-of-the-art would actually be better served by contributions of sub-optimal speech fro
How about artificial speech next? (Score:2)

by phorm ( 591458 ) writes:

I wonder if a better understanding of speech recognition - and having accurate voice models - would allow us to tweak or advanced articifical speech programs? While they probably won't do too much to help a computer understand the actual structure of a sentence (word recognition and pronunciation), it might allow them to produce words or sentences that flow more realistically or have more realistic peaks stresses on various words/syllables.

I've seen some decent ones, and the OS ones aren't better than the
- Re: (Score:1)
  
  by Anomalyst ( 742352 ) writes:
  
  I could have sworn that Steve Gibson wrote an article quite awhile ago on using DSP to join the words in a more natural sounding manner. Can't seem to find it with a search of
  "Steve Gibson" speech DSP words
  Did turn up this reference though "world gay escort dating free of charge", heh. A result of keyword stuffing in the linked site rather than a legitimate hit. Steve must be proud that his name is deemed such a valuable search keyword.
huh? (Score:1)

by b17bmbr ( 608864 ) writes:

what? you want to tickle my ass with a feather? oh, particularly nice weather.
Why not use the NIST database? (Score:4, Interesting)

by jesup ( 8690 ) * writes: <(randellslashdot) (at) (jesup.org)> on Tuesday October 10, 2006 @07:40PM (#16385567) Homepage

Back in roughly 1991 or 1992, I was working at Commodore/Amiga with AT&T DSP3210's (we were considering adding them to Amiga 3000's/4000's). They supported speech recognition, and due to that (somehow) I was asked to participate in a NIST program to collect speech samples over telephone. You made calls where you were randomly connected to another participant, and talked about a given topic. I imagine they were later transcribed; the purpose of this was to create a natural (connected) speech database for speech recognition researchers and vendors.

Since it was done by NIST, I imagine the database is available. Note that it will be limited by the telephone call quality (4KHz).

- Re: (Score:1)
  
  by hbr ( 556774 ) writes:
  
  This is probably what you mean:
  http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp? catalogId=LDC97S62 [upenn.edu]
  This kind of speech, um, yeah, is a - a world away, you know what I mean, from how most users speak to dictation software, command-and-control, etc.
  The Linguistic Data Consortium (LDC) http://www.ldc.upenn.edu/ [upenn.edu] is the main source of speech corpora that I know about. You have to pay and possibly be a member (depending on the corpus you want I think). The catalog covers all kinds of speech. Another sourc
  - Re: (Score:2)
    
    by jesup ( 8690 ) * writes:
    
    Yes, that's got to be it. Good find - and not free it appears.
- Re: (Score:2)
  
  by cyberon22 ( 456844 ) writes:
  
  Since it is funded by NIST, I imagine that the database is not available. This is the same organization that manages to conduct "open" testing of machine translation systems without making the actual translations public.
- Re: (Score:2)
  
  by LifesABeach ( 234436 ) writes:
  
  Just a thought, instead of a keyboard, use of phone?
How will it be distributed? (Score:2, Interesting)

by SpokenLang ( 792981 ) writes:

How much data are they planning to collect and how will it be distributed? The web site says that they will make available 8kHz and 16kHz versions of the data, both with 16-bit samples. These days, decent applications are trained on hundreds and even thousands of hours of audio. So, let's say they want to collect and distribute 1000 hours of 16kHz, 16 bit audio. That's 32,000 bytes per second of audio, or about 115 megabytes per hour, or 115 gigabytes per 1000 hours! Even 500 hours (58 gigs) is a LOT of da
- Re: (Score:2)
  
  by Clover_Kicker ( 20761 ) writes:
  
  Why not ship a hard drive with the data, and charge a nominal fee?
- BitTorrent! (Score:1)
  
  by ClioCJS ( 264898 ) writes:
  
  Just go to MiniNova.org, and do a search for "audiobook". THERE'S YOUR DATA, FOLKS.
  Nothing to hear here. Move along.
Waist uptime (Score:1)

by wirelessbuzzers ( 552513 ) writes:

Open sores peach wreck ignition is final ready.
How many times ... (Score:2)

by multimediavt ( 965608 ) writes:

... can you copy and paste, "Acoustic Models to be used by Speech Recognition Engines"?

Sorry, someone was excited about "Acoustic Models to be used by Speech Recognition Engines". [giggle]
Isn't this reinventing Librivox's wheel? (Score:4, Informative)

by jhutch2000 ( 801707 ) writes: on Wednesday October 11, 2006 @09:01AM (#16391501)

Librivox has THOUSANDS of hours of audio books available. Every last second is public domain.

I'm probably missing something in regards to why this stuff can't be used...

- Re: (Score:1)
  
  by beholder ( 35083 ) writes:
  
  Librivox's recording are continuous speech. VoxForge is looking for Command and Control phrases (short and snappy).
  
  I would assume the phrasing patterns would be quite different.
  - Re: (Score:1)
    
    by jhutch2000 ( 801707 ) writes:
    
    Ah, ok. I was assuming they wanted continuous speech patterns. I stand corrected, thanks.
Again with the GPL (Score:2)

by stonecypher ( 118140 ) writes:

This is the sort of effort in which commercial participation would be a strong benefit. If this was MIT or BSD license, I would put this into a specific one of my commercial products for the Nintendo DS, and I'd put a whole lot of work into it. But, I can't. Every time I talk about how I can't help certain projects, I get modded down as a troll, because I'm saying something a GPL fan doesn't want to hear. I'm not trolling, and I'm not being flamebait. This is a serious problem. I can name several othe

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re: (Score:2, Interesting)

What is your choice?..."Operator"...I'm sorry. (Score:2, Insightful)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re:A sound affair. (Score:4, Insightful)

Muffin for Jew to Ski here? (Score:2)

Re: (Score:1, Interesting)

Re: (Score:3, Interesting)

Re: (Score:2)

Computer. Go to Slashdot ... (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3, Informative)

Just what we need... (Score:2, Funny)

Re: (Score:1)

Re: (Score:2)

Anythings gotta be better than (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

GPL? (Score:3, Interesting)

Re: (Score:1, Interesting)

Well no, not *that* one. (Score:3, Insightful)

Re: (Score:2, Informative)

Re: (Score:2, Interesting)

Re: (Score:1)

Re: (Score:1)

Re: (Score:1)

Re: (Score:1)

IFA Dutch Corpus (Score:4, Informative)

Re: (Score:1)

Re: (Score:1)

It's about time (Score:5, Informative)

Re:It's about time (Score:5, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

two modes of speech-to-text, also (Score:5, Informative)

Here's how to help out (Score:5, Informative)

Re:Here's how to help out (Score:4, Funny)

Wreck A Nice Beach... (Score:1)

Re: (Score:3, Informative)

Re: (Score:2)

speech recognition (Score:1)

But you didn't leave your number (Score:1)

Data conditioning (GIGO) (Score:5, Insightful)

Re: (Score:2)

GIGO is older than you (Score:2)

Re: (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

slashdot met voxforge (Score:3, Funny)

Re: (Score:3, Funny)

Re: (Score:3, Insightful)

GPL versus public domain? (Score:5, Insightful)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Please (Score:3)

Speech to text overlooked (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:2)

Re: (Score:2)

It's about time (Score:3, Interesting)

Re: (Score:2)

Re: (Score:2)

Ill help (Score:2)

Re: (Score:1)

How about artificial speech next? (Score:2)

Re: (Score:1)

Well no, not that one. (Score:3, Insightful)