Utterance Structure

An utterance is the key data structure of the CEASR corpus.

CEASR contains utterances of the speech corpora together with their metadata and the transcripts, all stored in a unified format.

The utterances contain the following attributes:

Utterance Attribute  Description
identifier Utterance identifier (unique within corpus).
speaker_id Identifier of the speaker if available. 
dialect The ISO 639-1 language code with country code describing the speaker’s dialect. 
accent  The accent of the speaker (native or non-native)
gender The gender of the speaker. The value is normalised across all corpora and has the value ’male’ or ’female’.
reference The manual transcript provided as part of a speech corpus. Two variants of the reference are stored: original transcript and processed transcript. The reference also contains the start and end time of the utterance if provided as well as an information if this is an utterance containing only non-lexical sounds.
audio Detailed information about the audio recording of the utterance: audio file name, audio duration, sampling rate, bit depth, number of channels, encoding and number of samples.
recording Type of the recording device and acoustic environment (where available).
overlappings Information whether the utterance is overlapping with any other utterance.
speaker_noise_utterance Information whether the utterance contains only speaker noise such as laughter.
speaking_rate The number of words uttered per minute.
hypothesis The machine transcript generated by an ASR system and post-processed, transcript confidence and transcription language. Hypothesis also contains a list of words with confidence and time stamps
extra Additional utterance properties, such as mother tongue of a non-native speaker, region, education or age. (Available for a limited number of utterances).

The attributes values are provided only if they are available within the given corpus or returned by the given system.

The utterance data structure looks as follows:

{"identifier": String,
 "speaker_id": String,
 "dialect": String,
 "accent": String,
 "gender": String,
 "reference": {
     "text": String,
     "original_text": String,
     "only_non_lexical_sounds": Boolean,
     "start_time": Float,
     "end_time": Float},
"audio": {
    "audio_file_path": String,
    "duration": Float,
    "samplerate": Integer,
    "bitdepth": Integer,
    "channels": Integer,
    "encoding": String,
    "num_samples": Integer},
 "recording": {
     "acoustic_environment": String,
     "recording_device": String},
 "overlappings": Integer,
 "speaker_noise_utterance": Boolean,
 "speaking_rate": Float,
 "hypothesis": {
     "text": String,
     "confidence": Float,
     "transcription_language": String,
     "words": List},
 "extra": {
     "age": String,
     "network": String,
     "region": String,
     "mother_tongue": String,
     "date_of_birth": String,
     "primary_school": String,
     "education": String}
 }