Utterance Structure
An utterance is the key data structure of the CEASR corpus.
CEASR contains utterances of the speech corpora together with their metadata and the transcripts, all stored in a unified format.
The utterances contain the following attributes:
Utterance Attribute | Description |
identifier | Utterance identifier (unique within corpus). |
speaker_id | Identifier of the speaker if available. |
dialect | The ISO 639-1 language code with country code describing the speaker’s dialect. |
accent | The accent of the speaker (native or non-native) |
gender | The gender of the speaker. The value is normalised across all corpora and has the value ’male’ or ’female’. |
reference | The manual transcript provided as part of a speech corpus. Two variants of the reference are stored: original transcript and processed transcript. The reference also contains the start and end time of the utterance if provided as well as an information if this is an utterance containing only non-lexical sounds. |
audio | Detailed information about the audio recording of the utterance: audio file name, audio duration, sampling rate, bit depth, number of channels, encoding and number of samples. |
recording | Type of the recording device and acoustic environment (where available). |
overlappings | Information whether the utterance is overlapping with any other utterance. |
speaker_noise_utterance | Information whether the utterance contains only speaker noise such as laughter. |
speaking_rate | The number of words uttered per minute. |
hypothesis | The machine transcript generated by an ASR system and post-processed, transcript confidence and transcription language. Hypothesis also contains a list of words with confidence and time stamps |
extra | Additional utterance properties, such as mother tongue of a non-native speaker, region, education or age. (Available for a limited number of utterances). |
The attributes values are provided only if they are available within the given corpus or returned by the given system.
The utterance data structure looks as follows:
{"identifier": String,
"speaker_id": String,
"dialect": String,
"accent": String,
"gender": String,
"reference": {
"text": String,
"original_text": String,
"only_non_lexical_sounds": Boolean,
"start_time": Float,
"end_time": Float},
"audio": {
"audio_file_path": String,
"duration": Float,
"samplerate": Integer,
"bitdepth": Integer,
"channels": Integer,
"encoding": String,
"num_samples": Integer},
"recording": {
"acoustic_environment": String,
"recording_device": String},
"overlappings": Integer,
"speaker_noise_utterance": Boolean,
"speaking_rate": Float,
"hypothesis": {
"text": String,
"confidence": Float,
"transcription_language": String,
"words": List},
"extra": {
"age": String,
"network": String,
"region": String,
"mother_tongue": String,
"date_of_birth": String,
"primary_school": String,
"education": String}
}