Utterance Structure

An utterance is the key data structure of the CEASR corpus.

CEASR contains utterances of the speech corpora together with their metadata and the transcripts, all stored in a unified format.

The utterances contain the following attributes:

Utterance Attribute	Description
identifier	Utterance identifier (unique within corpus).
speaker_id	Identifier of the speaker if available.
dialect	The ISO 639-1 language code with country code describing the speaker’s dialect.
accent	The accent of the speaker (native or non-native)
gender	The gender of the speaker. The value is normalised across all corpora and has the value ’male’ or ’female’.
reference	The manual transcript provided as part of a speech corpus. Two variants of the reference are stored: original transcript and processed transcript. The reference also contains the start and end time of the utterance if provided as well as an information if this is an utterance containing only non-lexical sounds.
audio	Detailed information about the audio recording of the utterance: audio file name, audio duration, sampling rate, bit depth, number of channels, encoding and number of samples.
recording	Type of the recording device and acoustic environment (where available).
overlappings	Information whether the utterance is overlapping with any other utterance.
speaker_noise_utterance	Information whether the utterance contains only speaker noise such as laughter.
speaking_rate	The number of words uttered per minute.
hypothesis	The machine transcript generated by an ASR system and post-processed, transcript confidence and transcription language. Hypothesis also contains a list of words with confidence and time stamps
extra	Additional utterance properties, such as mother tongue of a non-native speaker, region, education or age. (Available for a limited number of utterances).

The attributes values are provided only if they are available within the given corpus or returned by the given system.

The utterance data structure looks as follows:

{"identifier": String,
 "speaker_id": String,
 "dialect": String,
 "accent": String,
 "gender": String,
 "reference": {
     "text": String,
     "original_text": String,
     "only_non_lexical_sounds": Boolean,
     "start_time": Float,
     "end_time": Float},
"audio": {
    "audio_file_path": String,
    "duration": Float,
    "samplerate": Integer,
    "bitdepth": Integer,
    "channels": Integer,
    "encoding": String,
    "num_samples": Integer},
 "recording": {
     "acoustic_environment": String,
     "recording_device": String},
 "overlappings": Integer,
 "speaker_noise_utterance": Boolean,
 "speaking_rate": Float,
 "hypothesis": {
     "text": String,
     "confidence": Float,
     "transcription_language": String,
     "words": List},
 "extra": {
     "age": String,
     "network": String,
     "region": String,
     "mother_tongue": String,
     "date_of_birth": String,
     "primary_school": String,
     "education": String}
 }