Developed to avoid situations like these, a ConText is the “fuel” that feeds a speech recognition engine. Specific to one language AND one field of expertise (e.g.: Radiology, UK.), ConTexts were invented by Philips to tailor a speech recognition system to a given professional environment. Let’s see what exactly is inside a speech recognition ConText and what is the methodology used to develop one.
Ingredient #1: ConText Lexicon. This is a list containing words and phrases specific to a particular usage and language, e.g. Radiology, UK. To create a valuable ConText Lexicon, approximately 100 million words from the language area are needed. In the case of a Radiology ConText, those words are best obtained from existing radiological reports in order to capture the generally applied terminology. Philips or their integration partner would typically gather the data required in form of totally anonymous reports from different healthcare organizations in a given country.
Ingredient #2: Background Lexicon. This is a dictionary containing between 300,000 and 800,000 words (depending on the language), whose usage is not considered frequent enough for inclusion in a specific ConText Lexicon. This background lexicon is used for reference when unknown words are added to the ConText during ConText Adaptation (process which updates an author’s language model and ConText Lexicon based on his correction of a “recognized report”, in order to improve the recognition rate.)
Ingredient #3: Default Language Model. This is the Context Lexicon plus a statistical model which represents word usage and sequences of words. It represents the way a group of persons use a language in a specific context, professional for instance. The language model is specific to an author and a Context.
Ingredient #4: Acoustic Reference. This is a collection of statistical data describing the vocal characteristics of an individual user. The production of a phoneme varies from a human being to another (variables include accents, age, pronunciation, etc.) and a language is not spoken in 2007 the way it was in the 1950’s. The Acoustic Reference will thereby “take a picture” of how a language is spoken at a given point in time. To develop an Acoustic Reference, say Swedish for instance, several hundred hours of spoken Swedish, covering all regions of the country are recorded and analyzed, resulting in an average model. Based on this average model, the speech recognition engine will then be able to interpret an author’s speech input and optimize the recognition rate regardless of his dialect, age, etc. This specific data, unique to each Author, is stored in an ARF (Acoustic Reference File).
The list of speech recognition Contexts developed by Philips to date can be found here.
Recent Comments