Archive for the 'Tech talk' Category

Speech Recognition & Sound Compression

Speech Recognition & Sound Compression The question of sound compression is often asked by CTOs, hence this dedicated thread. SpeechMagic provides high sound compression (Philips CELP – 19.2 kBit/s) to easily transfer sound data over band-limited channels with guaranteed high recognition rates. The following sound file formats can be processed by the SpeechMagic engine:

  • Philips CELP 16 kHz / 16 bit – 19 kbit/s (SpeechMagic native format) (8,24 MB/h)
  • Philips CELP 8 kHz / 16 bit – 19 kbit/s (SpeechMagic native format)
  • PCM 16 kHz / 16 bit – 256 kbit/s (PC)
  • PCM 11 kHz / 16 bit – 176 kbit/s (mobile input devices)
  • PCM 8 kHz / 8 bit and 16 bit – 64/128 kbit/s (telephone)
  • CCITT A-law, µ-law 8 kHz / 8 bit – 64 kbit/s (telephone)
  • DSS Standard Play

What is a Speech Recognition Context ?

What is a Speech Recognition Context ?Developed to avoid situations like these, a ConText is the “fuel” that feeds a speech recognition engine. Specific to one language AND one field of expertise (e.g.: Radiology, UK.), ConTexts were invented by Philips to tailor a speech recognition system to a given professional environment. Let’s see what exactly is inside a speech recognition ConText and what is the methodology used to develop one.

Ingredient #1: ConText Lexicon. This is a list containing words and phrases specific to a particular usage and language, e.g. Radiology, UK. To create a valuable ConText Lexicon, approximately 100 million words from the language area are needed. In the case of a Radiology ConText, those words are best obtained from existing radiological reports in order to capture the generally applied terminology. Philips or their integration partner would typically gather the data required in form of totally anonymous reports from different healthcare organizations in a given country.

Ingredient #2: Background Lexicon. This is a dictionary containing between 300,000 and 800,000 words (depending on the language), whose usage is not considered frequent enough for inclusion in a specific ConText Lexicon. This background lexicon is used for reference when unknown words are added to the ConText during ConText Adaptation (process which updates an author’s language model and ConText Lexicon based on his correction of a “recognized report”, in order to improve the recognition rate.)

Ingredient #3: Default Language Model. This is the Context Lexicon plus a statistical model which represents word usage and sequences of words. It represents the way a group of persons use a language in a specific context, professional for instance. The language model is specific to an author and a Context.

Ingredient #4: Acoustic Reference. This is a collection of statistical data describing the vocal characteristics of an individual user. The production of a phoneme varies from a human being to another (variables include accents, age, pronunciation, etc.) and a language is not spoken in 2007 the way it was in the 1950’s. The Acoustic Reference will thereby “take a picture” of how a language is spoken at a given point in time. To develop an Acoustic Reference, say Swedish for instance, several hundred hours of spoken Swedish, covering all regions of the country are recorded and analyzed, resulting in an average model. Based on this average model, the speech recognition engine will then be able to interpret an author’s speech input and optimize the recognition rate regardless of his dialect, age, etc. This specific data, unique to each Author, is stored in an ARF (Acoustic Reference File).

The list of speech recognition Contexts developed by Philips to date can be found here.

The beauty of the network approach

The beauty of the network approach Why does professional speech recognition work so well as opposed to individual applications? Well, let’s think about it. What professional SR does is networking multiple physicians from a same specialty across what they have in common: their language patterns and medical vocabulary. This collegial approach makes a huge difference in itself, since the SR engine will be loaded with vocabulary specific to Pathology or Cardiology for instance. In specialties where medical terminology prevails in the reporting process (i.e.: Radiology as opposed to Psychiatry), great results are achieved right from the start.

Achieving similar results as a consumer would require patience let alone advanced organization skills. Let’s say I’m a soccer fan using speech recognition to comment game strategies, I’d better be part of a networked community sharing the exact same interest and using speech recognition for the exact same purpose…

Now, what about individual pronunciations? How does the engine work this out? Once a “speech recognized” report has been corrected and signed off, the speech recognition engine initiates what is probably the most important phase of all; it is called Adaptation. During adaptation, the SR engine makes all the required adjustment by comparing the recognized -draft- report and its corrected -final- version, matching a specific pronunciation with a specific word here, collecting an unknown word to be added to the lexicon there. And because this lexicon is shared with other users in the department, every new word is automatically and immediately made available to everyone else on the network. That’s nothing more than the whole “United we stand, divided we fall” concept at work.

Speaking hardware: what are the DSS and DSSPro standards?

Digital Dictation Hardware Healthcare facilities setting sail for speech recognition are typically advised to equip their physician staff with digital dictation devices that support the DSS or DSS Pro format. The reasons? Optimal sound quality and sampling rates; both key ingredients to a successful speech recognition experience.

A bit of history first. The .dss format was created by a voluntary organization called the International Voice Association (IVA), formed jointly by Grundig, Olympus and Philips back in 1994. DSS is maintained as a manufacturer-independent and international standard for professional speech processing that can be used – under certain conditions – by any manufacturer, as long as it is used in professional devices. This guarantees the user a secure investment in terms of the procurement, use and future compatibility of his systems.

DSS offers high audio quality and allows a high compression rate without noticeable loss of quality, as well as low energy consumption. The compression was to permit efficient memory usage and data transfer for digitized speech. The quality had to be retained so that even quietly spoken passages could be clearly understood and speech recognition could be applied. At the same time, everything had to be accomplished at a reasonable computational expense in order to keep power consumption in check because mobile dictation devices are frequently used for extended periods.

DSS is often called “MP3 for Speech”. As a compression algorithm for speech, DSS is comparable with the music format MP3. Although the sound quality differs only negligibly from the uncompressed original, .dss files are very small. This allows them to be transferred quickly to the PC and easily sent by e-mail. Because the technology only compresses the parts of speech that are truly important, the standard practically filters out the concentrated speech of a dictation without losing quality. A 10-minute dictation that requires only about 1 MB in the .dss format, requires up to 12 times as much memory with typical compression.

In March 2007, the IVA launched DSSPro, presenting it as being “far more than just a speech recording standard – DSSPro actually allowing far-reaching management functions for the workflow,” thanks to the following new functions:

  • Support of real-time file encryption during recording to protect confidential dictation data.
  • Higher 16 kHz sampling rate provides a more natural playback of human voice as well as optimized quality for speech recognition.

Popular digital dictation devices supporting the DSS format:

Philips SpeechMike range (PC microphone)
Philips Digital Pocket Memo 9360
(mobile recorder)
Olympus DS-3300 (mobile recorder)
Grundig Digta CordEx (PC microphone)

Popular digital dictation devices supporting the DSSPro format:
Philips Digital Pocket Memo 9600 (mobile recorder)
Grundig DigtaSonic xMic (PC microphone)
Olympus DS-4000 (mobile recorder)

The 4 Commandments of Intelligent Speech Recognition

The 4 Commands of Professional Speech Recognition We tend to think that speech recognition works by understanding the phonetics behind words and the way a user pronounces those very sounds. Well, that’s “voice” recognition, not “speech” recognition. To be beneficial in a professional document creation approach, a system must be able to interpret what the speaker means, beyond the successful sound-word association. So when you think about it, speech recognition is more about syntax and probability models than sound analysis. This is what Philips calls Intelligent Speech Interpretation, with a fourfold mission that I’m going to baptize the “4 Commandments of Speech Recognition” as opposed to “Voice” Recognition.

Thou shall emulate the capabilities of a good medical Transcriptionist
Just like a medical transcriptionist, the system goes beyond simply typing what was dictated by the physician. The first step is to leave out the ‘um’s and ‘eh’s and ignore the “one lattee and chocolate donut, please” that doesn’t belong to the diagnosis. The system is then able to format and organize text, add section headings, numbering lists and standard blocks of text, and even rephrase sentences when needed.

Thou shall detect and filter background noise
ER physicians will understand what I mean but “background noise…” The challenge for a speech recognition system is to be able to filter out those acoustic events, which have no relevance for the current report. What the system does is comparing those events with known variations in speaker characteristics in order to compensate for deviations. The same rule is applied to dialects, pitch and speed variations, and clarity of pronunciation.

Thou shall not forget that a word is part of a sentence
As described by Marcel Wassink, Managing Director for Philips Speech Recognition Systems,” awareness of what people are likely to say not only helps recognize what they do say, it also helps identify what doesn’t belong, for example, “PET” (photon emission tomography) is more likely in a radiologist’s report than “pet” (an animal kept at home). This awareness is also about knowing the probability of a particular word, given the words used before: the probability of “PET” being followed by “scan” is much higher than it being followed by “food”. Speech recognition thereby offers dedicated dictionaries related to the physician’s speciality that maximizes the recognition of complex profession-related terminology.”

Thou shall think twice
“The system works internally with phonetic representations of words, and rules for the structures of phrases, sentences and documents. Basic representations and rules, along with suitable vocabulary, are initially entered into the system, which then statistically examines large numbers of existing texts. When transcribing a dictation, the system compares the words on hand with these statistics to imply the word, phrase, sentence or document section, and adjust the output accordingly.”

Here are some of the big breakthroughs that changed the speech recognition industry during the past decade, and at the same time, splitting the market in two: the professional market and the consumer market. And indeed, I don’t see how off-the-shelf, basic voice recognition software could be of any help to healthcare users looking to automate the entire documentation workflow. In my opinion, that would be like trying to build a six lane highway using backyard-digging and earth-moving equipment from the Home Depot…

To find out more on Intelligent Speech Interpretation, you can refer to the following white paper or article from the e-Health Insider.

Blog Stats

  • 94,347 hits