Check out this website I found at aurix.com
The Aurix phonetic speech search engine The Aurix phonetic speech search engine represents a significant advance over traditional speech-to-text engines because searches are based on the way a word sounds, rather than specific spellings of a word. This means the speed and accuracy of searches is dramatically increased, without the need for the huge computing power required by speech-to-text solutions. As a result it enables a much wider range of applications to unlock the business intelligence in both real-time and recorded audio material. The Aurix phonetic speech search engine differs from speech-to-text or Large Vocabulary Continuous Speech Recognition (LVCSR) systems in several key respects: Indexing: At the indexing stage, LVCSR systems search audio material for specific words, against a dictionary, containing hundreds of thousands of words; this makes indexing very slow and can lead to permanent errors. In contrast the Aurix phonetic audio search engine does not look for words, but transforms audio recordings directly into possible phonemes, one of the basic units of human speech understanding. A crucial benefit of the Aurix phonetic audio search engine is that all the phonetic intelligence in the audio signal is retained until search, unlike LVCSR mining where much of the phonetic intelligence is discarded when the text-based transcription is generated. As a result audio data is instantly searchable supporting real-time monitoring of many thousands of calls or other audio material. The Aurix phonetic indexing method also allows a much higher volume of recordings to be processed, more quickly, with less hardware power than LVCSR systems. In fact, audio is ‘ingested’ or indexed at rates 80x faster than it is spoken. And the index files are compressed as they are generated.. Finally, this indexing is performed only once and can be searched for multiple terms as many times as required. Search: Because phonetic search is not limited by the specific words in the dictionary, it is possible to produce a better match for specific query terms, effectively casting a wider net than LVCSR systems and increasing accuracy. Searches are conducted using phoneme strings derived from search words and phrases and multiple results are returned, sorted by confidence. The search process uses Hidden Markov Models (HMM) and dynamic programming algorithms to perform keyword searches on the audio stream.