What is CEASR
CEASR is a Corpus for Evaluating the quality of Automatic Speech Recognition (ASR). It is a data set based on public speech corpora, containing speech utterances along with their transcripts generated by several modern state-of-the-art commercial and open-source ASR systems.
The speech corpora selected for CEASR are standard corpora often cited in the literature. They represent a variety of speaking styles (read-aloud vs. spontaneous, monologue vs. dialogue), speaker demographics (native vs. nonnative, different dialectal regions, age, gender and native language), recording environments and audio quality types (e.g. recording studio and telephone line), and thus allow for a nuanced evaluation of ASR performance.
The ASR systems, we applied, reflect the current market and development landscape and as such include both commercial providers and open-source frameworks. The latest version of CESAR is a snapshot of state-of-the-art ASR technology from 2019.
To our knowledge, CEASR is the first corpus where transcriptions of multiple ASR systems are collected and published. This allows researchers to explore the capabilities of ASR systems in various settings without the tedious and time-consuming effort of creating the transcriptions. References and hypotheses are provided in a unified format, which facilitates the development of scripts and tools for their processing and analysis.
CEASR can be used for many applications, e.g.:
- a reference benchmark for ASR quality tracking
- detailed error analysis (e.g. error typology by acoustic setting, development of alternative error metrics)
- detailed ASR quality evaluation (e.g. in relation to speaker demographic profiles or spoken language properties)
- improving ASR, for example by developing ensemble learning methods based on the output of different systems