Data
CEASR contains transcriptions of nine English and six German speech corpora generated by seven different ASR systems.
Due to legal constraints, not all of the corpus references and system hypotheses contained in CEASR corpus can be made publicly available. We are also not allowed to publish the names of the commercial systems. These names are replaced with IDs (an ID consists of a letter which stands for a system name and a digit which stands for the particular system configuration).
The publicly available part of the CEASR corpus is presented below.
English Data Set
The data set comprises:
- 25’094 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
- 38’465 tokens
- 35.1 hours of speech (audio recordings are not included in the corpus) transcribed with 6 ASR systems (3 commercial and 3 open-source ones)
- over 640 speakers including over 130 non-native speakers
Audio recordings, manual transcripts and metadata have been derived from six public corpora:
- Spontaneous dialogue speech corpus AMI
- Semi-spontaneous monologue speech corpus TedLium
- Read-aloud corpora: ST, LibriSpeech Clean and Other VoxForge, and CommonVoice.
German Data Set
The data set comprises:
- 13’689 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
- 28’592 tokens
- 21.6 hours of speech (audio recordings are not included in the corpus) transcribed with 3 commercial ASR systems
- over 723 speakers including over 194 non-native speakers
Audio recordings, manual transcripts and metadata have been derived from six public corpora:
- Spontaneous monologue speech corpus Hempel
- Spontaneous dialogue speech corpus Verbmobil II v.21
- Read-aloud corpora: VoxForge, CommonVoice, Tuda-De
- Strange Corpus 10 containing spontaneous monologue and dialogue speech as well as read-aloud utterances