Audio Recordings Download

The CEASR data set provides transcripts of audio recordings but does not provide the recordings themselves.

In order to get access to the audio files, the corresponsing speech corpora mentioned under What is CEASR needs to be downloaded or purchased separately. All English and three of the German corpora included in CEASR are available for download without any costs. The remaining three German corpora must be purchased. The information on the corpus availability as well as the download links or links to speech resources catalogues are provided within the corpus object of each JSON file. An example for the CommonVoice corpus:

{"corpus": {
    "identifier": "commonvoice_de",
    "license": "https://www.mozilla.org/en-US/foundation/licensing/website-content/",
    "documentation_link": "https://voice.mozilla.org/de",
    "download_link": "https://voice.mozilla.org/de/datasets",
    "incomplete_utterances_removed_%": 0.0,
    "missing_corpus_data": {
        "missing_speaker_id": 5633,
        "missing_dialect": 4927,
        "missing_accent": 4874,
        "missing_gender": 4680},
    "original_reference_segmentation": "multiple_words",
    "original_audio_segmentation": "multiple_words",
    "free_of_charge": true}

After downloading a speech corpus, you can retrieve the names of the audio files corresponding to each utterance from the JSON data file.

Open a JSON data file which contains transcripts of this corpus, e.g. after downloading CommonVoice go to CEASR to the JSON file kaldi_aspire__commonvoice.json (or any other file ending with __commonvoice.json).

In kaldi_aspire__commonvoice.json, go to the utterances list in the dataset object. For each utterance in the list, the audio_file_path is provided in the utterance audio object, e.g.:

"audio": {
    "audio_file_path": "/commonvoice/cv_corpus_v1/cv-valid-test/sample-000000.mp3",
    "duration": 3.192,
    "samplerate": 8000,
    "bitdepth": 16,
    "channels": 1,
    "encoding": "Signed Integer PCM",
    "num_samples": 25536},

This is the path to the original audio file as provided by the corpus.

If the corpus contains unsegmented audio recordings, as it is the case for example for the AMI corpus (the audio files contain recordings of the whole meetings), you need to trim the audios according to the start time and end time provided in the reference object of the utterance. The path to the original unsegmented audio file is provided also in the audio object as audio_file_path. E.g.:

"reference": {
    "text": "last",
    "original_text": "Last .",
    "only_non_lexical_sounds": false,
    "start_time": 142.832,
    "end_time": 144.896},
"audio": {
    "audio_file_path": "/ami/amicorpus/IS1000a/audio/IS1000a.Headset-2.wav",
    "duration": 2.064,
    "samplerate": 8000,
    "bitdepth": 16,
    "channels": 1,
    "encoding": "Signed Integer PCM",
    "num_samples": 16512}