Data

CEASR contains transcriptions of nine English and six German speech corpora generated by seven different ASR systems.

Due to legal constraints, not all of the corpus references and system hypotheses contained in CEASR corpus can be made publicly available. We are also not allowed to publish the names of the commercial systems. These names are replaced with IDs (an ID consists of a letter which stands for a system name and a digit which stands for the particular system configuration).

The publicly available part of the CEASR corpus is presented below.

English Data Set

The data set comprises:

25’094 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
38’465 tokens
35.1 hours of speech (audio recordings are not included in the corpus) transcribed with 6 ASR systems (3 commercial and 3 open-source ones)
over 640 speakers including over 130 non-native speakers

Audio recordings, manual transcripts and metadata have been derived from six public corpora:

Spontaneous dialogue speech corpus AMI
Semi-spontaneous monologue speech corpus TedLium
Read-aloud corpora: ST, LibriSpeech Clean and Other VoxForge, and CommonVoice.

German Data Set

The data set comprises:

13’689 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
28’592 tokens
21.6 hours of speech (audio recordings are not included in the corpus) transcribed with 3 commercial ASR systems
over 723 speakers including over 194 non-native speakers

Audio recordings, manual transcripts and metadata have been derived from six public corpora:

Spontaneous monologue speech corpus Hempel
Spontaneous dialogue speech corpus Verbmobil II v.21
Read-aloud corpora: VoxForge, CommonVoice, Tuda-De
Strange Corpus 10 containing spontaneous monologue and dialogue speech as well as read-aloud utterances