Data

CEASR contains transcriptions of nine English and six German speech corpora generated by seven different ASR systems.

Due to legal constraints, not all of the corpus references and system hypotheses contained in CEASR corpus can be made publicly available. We are also not allowed to publish the names of the commercial systems. These names are replaced with IDs (an ID consists of a letter which stands for a system name and a digit which stands for the particular system configuration).

The publicly available part of the CEASR corpus is presented below.

English Data Set

The data set comprises:

  • 25’094 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
  • 38’465 tokens
  • 35.1 hours of speech (audio recordings are not included in the corpus) transcribed with 6 ASR systems (3 commercial and 3 open-source ones)
  • over 640 speakers including over 130 non-native speakers

Audio recordings, manual transcripts and metadata have been derived from six public corpora:

German Data Set

The data set comprises:

  • 13’689 utterances with metadata (e.g. speaker gender, speaker accent, recording device and more)
  • 28’592 tokens
  • 21.6 hours of speech (audio recordings are not included in the corpus) transcribed with 3 commercial ASR systems
  • over 723 speakers including over 194 non-native speakers

Audio recordings, manual transcripts and metadata have been derived from six public corpora: