Archive for August, 2007

European speech recognition saga: Episode 3

Munich, Germany Looks like our tour of Europe is far from being over as the news just got through that…

- The Munich public hospital network (5 sites - 3,500 beds) will be rolling out a speech recognition system to a total of 100 physicians and 32 workstations in Radiology.

- The AZ Sint-Jan Hospital in Brugge, Belgium has completed an interface between speech recognition and their electronic health record (EHR) system. Driven by the Radiology and the Pathology departments, the 900-bed hospital has deployed the new reporting solution across all specialties including Cardiology, Gynecology and Orthopedics.

- 60 radiologists at the Aberdeen Royal Infirmary (ARI) claim they are now in full control of the creation of medical report thanks to the implementation of a front-end speech recognition system.

No apparent summer break for speech recognition. News of more installations keep coming in every week. Now, this is getting really exciting…

What is a Speech Recognition Context ?

What is a Speech Recognition Context ?Developed to avoid situations like these, a ConText is the “fuel” that feeds a speech recognition engine. Specific to one language AND one field of expertise (e.g.: Radiology, UK.), ConTexts were invented by Philips to tailor a speech recognition system to a given professional environment. Let’s see what exactly is inside a speech recognition ConText and what is the methodology used to develop one.

Ingredient #1: ConText Lexicon. This is a list containing words and phrases specific to a particular usage and language, e.g. Radiology, UK. To create a valuable ConText Lexicon, approximately 100 million words from the language area are needed. In the case of a Radiology ConText, those words are best obtained from existing radiological reports in order to capture the generally applied terminology. Philips or their integration partner would typically gather the data required in form of totally anonymous reports from different healthcare organizations in a given country.

Ingredient #2: Background Lexicon. This is a dictionary containing between 300,000 and 800,000 words (depending on the language), whose usage is not considered frequent enough for inclusion in a specific ConText Lexicon. This background lexicon is used for reference when unknown words are added to the ConText during ConText Adaptation (process which updates an author’s language model and ConText Lexicon based on his correction of a “recognized report”, in order to improve the recognition rate.)

Ingredient #3: Default Language Model. This is the Context Lexicon plus a statistical model which represents word usage and sequences of words. It represents the way a group of persons use a language in a specific context, professional for instance. The language model is specific to an author and a Context.

Ingredient #4: Acoustic Reference. This is a collection of statistical data describing the vocal characteristics of an individual user. The production of a phoneme varies from a human being to another (variables include accents, age, pronunciation, etc.) and a language is not spoken in 2007 the way it was in the 1950’s. The Acoustic Reference will thereby “take a picture” of how a language is spoken at a given point in time. To develop an Acoustic Reference, say Swedish for instance, several hundred hours of spoken Swedish, covering all regions of the country are recorded and analyzed, resulting in an average model. Based on this average model, the speech recognition engine will then be able to interpret an author’s speech input and optimize the recognition rate regardless of his dialect, age, etc. This specific data, unique to each Author, is stored in an ARF (Acoustic Reference File).

The list of speech recognition Contexts developed by Philips to date can be found here.

Thought of the day: “Contextual Intelligence Matters…”

contextual intelligence matters

Ooooppss…doesn’t it?

The beauty of the network approach

The beauty of the network approach Why does professional speech recognition work so well as opposed to individual applications? Well, let’s think about it. What professional SR does is networking multiple physicians from a same specialty across what they have in common: their language patterns and medical vocabulary. This collegial approach makes a huge difference in itself, since the SR engine will be loaded with vocabulary specific to Pathology or Cardiology for instance. In specialties where medical terminology prevails in the reporting process (i.e.: Radiology as opposed to Psychiatry), great results are achieved right from the start.

Achieving similar results as a consumer would require patience let alone advanced organization skills. Let’s say I’m a soccer fan using speech recognition to comment game strategies, I’d better be part of a networked community sharing the exact same interest and using speech recognition for the exact same purpose…

Now, what about individual pronunciations? How does the engine work this out? Once a “speech recognized” report has been corrected and signed off, the speech recognition engine initiates what is probably the most important phase of all; it is called Adaptation. During adaptation, the SR engine makes all the required adjustment by comparing the recognized -draft- report and its corrected -final- version, matching a specific pronunciation with a specific word here, collecting an unknown word to be added to the lexicon there. And because this lexicon is shared with other users in the department, every new word is automatically and immediately made available to everyone else on the network. That’s nothing more than the whole “United we stand, divided we fall” concept at work.

Speaking hardware: what are the DSS and DSSPro standards?

Digital Dictation Hardware Healthcare facilities setting sail for speech recognition are typically advised to equip their physician staff with digital dictation devices that support the DSS or DSS Pro format. The reasons? Optimal sound quality and sampling rates; both key ingredients to a successful speech recognition experience.

A bit of history first. The .dss format was created by a voluntary organization called the International Voice Association (IVA), formed jointly by Grundig, Olympus and Philips back in 1994. DSS is maintained as a manufacturer-independent and international standard for professional speech processing that can be used - under certain conditions - by any manufacturer, as long as it is used in professional devices. This guarantees the user a secure investment in terms of the procurement, use and future compatibility of his systems.

DSS offers high audio quality and allows a high compression rate without noticeable loss of quality, as well as low energy consumption. The compression was to permit efficient memory usage and data transfer for digitized speech. The quality had to be retained so that even quietly spoken passages could be clearly understood and speech recognition could be applied. At the same time, everything had to be accomplished at a reasonable computational expense in order to keep power consumption in check because mobile dictation devices are frequently used for extended periods.

DSS is often called “MP3 for Speech”. As a compression algorithm for speech, DSS is comparable with the music format MP3. Although the sound quality differs only negligibly from the uncompressed original, .dss files are very small. This allows them to be transferred quickly to the PC and easily sent by e-mail. Because the technology only compresses the parts of speech that are truly important, the standard practically filters out the concentrated speech of a dictation without losing quality. A 10-minute dictation that requires only about 1 MB in the .dss format, requires up to 12 times as much memory with typical compression.

In March 2007, the IVA launched DSSPro, presenting it as being “far more than just a speech recording standard - DSSPro actually allowing far-reaching management functions for the workflow,” thanks to the following new functions:

  • Support of real-time file encryption during recording to protect confidential dictation data.
  • Higher 16 kHz sampling rate provides a more natural playback of human voice as well as optimized quality for speech recognition.

Popular digital dictation devices supporting the DSS format:

Philips SpeechMike range (PC microphone)
Philips Digital Pocket Memo 9360
(mobile recorder)
Olympus DS-3300 (mobile recorder)
Grundig Digta CordEx (PC microphone)

Popular digital dictation devices supporting the DSSPro format:
Philips Digital Pocket Memo 9600 (mobile recorder)
Grundig DigtaSonic xMic (PC microphone)
Olympus DS-4000 (mobile recorder)

Bilingual speech recognition doesn’t let physicians get lost in translation

Bilingual Speech Recognition In border regions like Eastern Ontario, Canada, two languages are spoken. When a patient comes in a hospital saying either “it hurts” or “j’ai mal”, healthcare organizations thereby have add the language factor to an already complex documentation workflow. As obvious as it may sound, isn’t documenting a case in the patient’s native language the very first step to accurate and quality healthcare? And since nothing seems to stop 21st century hospitals on their way to forging this long-awaited though modern healthcare, a hospital in bilingual Ottawa region has decided to take on the challenge using bilingual speech recognition; a North American - if not worldwide - first.

Ottawa based Hôpital Montfort is a 206 bed facility that boasts 100 physicians on its active medical staff and twelve medical Transcriptionists. After implementing an integrated document creation platform including digital dictation, transcription and distribution in 2007, the hospital is now set to implement an additional, bilingual speech recognition module to further accelerate the processing of reports based on the patient’s language. Based on the language set either within the physician’s profile or upon physician’s login, the speech recognition engine would launch the proper language ConText in the background. So if a French speaking patient comes in, the physician would dictate in French using a French-Canadian speech recognition ConText. The voice file would then be automatically routed to a French-speaking correction resource, and the final report issued in French.

On the other hand, and in order to ensure the instant availability of up-to-date patient results and demographics to all relevant medical staff., bi-directional HL7 interfaces have already been implemented between the dictation-transcription platform and the Hospital Information System (Admission, Discharge, Transfer and ADT). A similar interface has also been implemented with the hospital’s Patient Care Inquiry module, thus enabling instant viewing of pathology reports, once again, regardless of the language used.

The project has raised interest in other parts of the world, and a large hospital in France is currently looking to implement a similar solution.

More on European Speech Recognition Projects

Nick van Terheyden, MD Industry expert Nick van Terheyden, MD, Chief Medical Officer, Philips Speech Recognition Systems, wanted to comment on my previous thread about European speech recognition projects with a few additional figures. So it is with pleasure that I am posting van Terheyden’s thread today for another tour of Europe.

Speech recognition has definitely reached “tipping point” in the Old Continent with a rather impressive number of projects underway:

  • The Dutch are driving forward with an adoption rate estimated at 80% in several specialties across the Netherlands.
  • The Spanish are surging ahead with 50% of Spain’s radiologists using front-end speech recognition in the Valencia region.
  • The Norwegians are notably ahead with 100% of Norway’s Healthcare regions implementing speech recognition.
  • The Danes are delivering value at the Vejle County Hospital, where speech recognition is fully integrated with their electronic health record system for 1,400 users; an overall productivity rise of 5 to 7 percent that represents savings of several million Danish Kroner (1m DKK = 184,000 USD).
  • The French are forging forward with all 39 Public hospitals in Paris (15,000 physicians and transcriptionists in total) to be equipped with speech recognition by 2010.
  • And the Italians in all this? With no less than 22 hospitals in the idyllic Friuli-Venezia Giulia region having recently adopted front-end speech recognition, legions of physicians are just about to cross the RubiCon-Text. Alea jacta est.

Note: a ConText is a collection of acoustic data and vocabulary that reflects the spoken and written language used by professionals in a specific medical specialty, as developed by Philips Speech Recognition Systems.

The Initial Training Myth

The Initial Training Myth How many times have I heard physicians voice concern over the initial time required to “train” a speech recognition system in those words: “too long” and “not worth the effort”. Well, that might have been true 10 years ago. And that might still be true for consumer products, which are not tailored for a specific profile of users, like professional speech recognition is for healthcare. Sit back and relax, as here come the good news:

With professional speech recognition, the voice model training (initial training) typically takes two minutes and is often not necessary for native speakers. For non-native speakers or speakers with a strong accent up to ten minutes of initial training are recommended. Typically, voice model training is carried out using a wizard requiring the physician to read out a given text according to which the voice model is adjusted.

An example of integrated PACS / speech recognition

An example of integrated PACS / speech recognition With over 110 physicians, The Buffalo Medical Group is one of the oldest and largest multi-specialty physician group practices in NY State. In 2004, the organization chose to implement a back-end speech recognition system fully interfaced with their PACS in order to streamline the documentation workflow in their Radiology Department. Michelle Roesler, director of radiology, shared her experience in a 2006 issue of Health Imaging & IT:

At this point, all of the reports generated through speech recognition go through transcription for quality assurance and are than sent back to the radiologist for sign off.

Speech recognition is integrated with the PACS, which simplifies workflow and increases productivity, enabling radiologists to begin their dictation immediately without the need to enter details such as bar codes. Completed results can be seamlessly uploaded to the PACS through an HL7 connection. All information is exchanged electronically through a bidirectional interface. Once a dictation is complete, the report is processed in the back-end by the speech recognition server and made available to a transcriptionist for correction. The integration between the two environments also eliminates the need for the user to log on twice: the digital dictation application is started and ended automatically by logging on and off PACS. When a radiologist pulls up images on PACS, the medical record number is blown in for them.

Voice streaming over IP is used in parallel. This service was designed by the vendor to automatically determine the fastest method for sending files to the central system, thereby eliminating bandwidth-intensive methods for voice file transfer purposes such as FTP, e-mail, or file copying. Dictations, whether completed in the office or on the go, are immediately made available for transcription. Only the required information needs to be transferred to the transcriptionist and no file is ever stored on the local PC. Centralizing all client/patient information, document types and templates on the main server allows BMG to preserve the integrity of voice files while ensuring an optimal level of security in the document production process.

The 4 Commandments of Intelligent Speech Recognition

The 4 Commands of Professional Speech Recognition We tend to think that speech recognition works by understanding the phonetics behind words and the way a user pronounces those very sounds. Well, that’s “voice” recognition, not “speech” recognition. To be beneficial in a professional document creation approach, a system must be able to interpret what the speaker means, beyond the successful sound-word association. So when you think about it, speech recognition is more about syntax and probability models than sound analysis. This is what Philips calls Intelligent Speech Interpretation, with a fourfold mission that I’m going to baptize the “4 Commandments of Speech Recognition” as opposed to “Voice” Recognition.

Thou shall emulate the capabilities of a good medical Transcriptionist
Just like a medical transcriptionist, the system goes beyond simply typing what was dictated by the physician. The first step is to leave out the ‘um’s and ‘eh’s and ignore the “one lattee and chocolate donut, please” that doesn’t belong to the diagnosis. The system is then able to format and organize text, add section headings, numbering lists and standard blocks of text, and even rephrase sentences when needed.

Thou shall detect and filter background noise
ER physicians will understand what I mean but “background noise…” The challenge for a speech recognition system is to be able to filter out those acoustic events, which have no relevance for the current report. What the system does is comparing those events with known variations in speaker characteristics in order to compensate for deviations. The same rule is applied to dialects, pitch and speed variations, and clarity of pronunciation.

Thou shall not forget that a word is part of a sentence
As described by Marcel Wassink, Managing Director for Philips Speech Recognition Systems,” awareness of what people are likely to say not only helps recognize what they do say, it also helps identify what doesn’t belong, for example, “PET” (photon emission tomography) is more likely in a radiologist’s report than “pet” (an animal kept at home). This awareness is also about knowing the probability of a particular word, given the words used before: the probability of “PET” being followed by “scan” is much higher than it being followed by “food”. Speech recognition thereby offers dedicated dictionaries related to the physician’s speciality that maximizes the recognition of complex profession-related terminology.”

Thou shall think twice
“The system works internally with phonetic representations of words, and rules for the structures of phrases, sentences and documents. Basic representations and rules, along with suitable vocabulary, are initially entered into the system, which then statistically examines large numbers of existing texts. When transcribing a dictation, the system compares the words on hand with these statistics to imply the word, phrase, sentence or document section, and adjust the output accordingly.”

Here are some of the big breakthroughs that changed the speech recognition industry during the past decade, and at the same time, splitting the market in two: the professional market and the consumer market. And indeed, I don’t see how off-the-shelf, basic voice recognition software could be of any help to healthcare users looking to automate the entire documentation workflow. In my opinion, that would be like trying to build a six lane highway using backyard-digging and earth-moving equipment from the Home Depot…

To find out more on Intelligent Speech Interpretation, you can refer to the following white paper or article from the e-Health Insider.

Next Page »


Blog Stats

  • 19,766 hits