Benchmarks & Datasets


Three benchmarks have been defined so far: ABX-15, ABX-17 and ABX-17. These benchmarks are chiefly intended for evaluation (test sets), but we also provide associated training sets. It is entirely possible to use these benchmarks with other training sets, providing no label or supervision other than speaker ID is used to train the systems.


ABX-15 contains conversational English based on a fragment of the Buckeye dataset ( Citation: , & al., , , , , , & (). Buckeye corpus of conversational speech (2nd release).; Columbus, OH: Department of Psychology, Ohio State University (Distributor). ) , and Xitsonga, a fragment of the NCHLT dataset (read speech). In both cases, the test sets was used as train set in the original 2015 challenge (which may seem weird, but ok since it is unsupervised learning).

ABX-17 was aimed at testing robustness of the algorithms to languages and speakers. They were 3 dev languages (English, French and Mandarin) and 2 held-out test languages (German and Wolof). In addition, the training set and test sets were separated. The training set was deliberately setup with a power law imbalance in speakers. The test set was split into small files of varyying durations (1s, 10s, and 120 s), in order to evaluate algorithms that would perform speaker normalization on the fly at test time (the same data is distributed in the three durations, to allow for comparison).

ABX-LS is based on the popular Librispeech dataset and is intended to test scaled up versions of unit discovery. This allows for a welcome split into a dev and a test set, and also split into ‘clean’ and ‘other’ to test for robustness to noisy data. As training set, participants can use the different sections of Librispeech (100, 360, 960) or Librilight (60k, 6k, etc).

Table. Characteristics of the different ZRC ABX Benchmarks.

Benchmark Language Dataset Type Train Set (Duration/Speakers) Test Set (Duration/Speakers)
ABX-15 English Buckeye conversations same as test set 5h, 12spk
^^ Xitsonga NCHLT Timit-like 2h30, 24spk
ABX-17 English Librivox audiobook 45h, 69 spk 27h, 9spk
^^ French Librivox audiobook 24h, 28 spk 17h, 10spk
^^ Mandarin THCHS-30 read speech 2h30, 12 spk 25h, 4spk
^^ German (L1) Librivox audiobook 25h, 30 spk 11h, 10spk
^^ Wolof (L2) Timit-like 10h, 14 spk 5.9h, 4spk
ABX-LS English Librispeech audiobook libriSpeech,Libri-light, etc. dev/test x clean/other: 5h each, 40 spk (clean), 33 spk (other)

Dataset References

  • Buckeye ( Citation: , & al., , , , , , & (). Buckeye corpus of conversational speech (2nd release).; Columbus, OH: Department of Psychology, Ohio State University (Distributor). )
  • NCHLT ( Citation: , (). The NCHLT speech corpus of the south african languages..; 4th International Workshop on Spoken Language Technologies for Under-Resourced Languages, St Petersburg, Russia. Retrieved from )
  • Librivox
  • THCHS-30 ( Citation: , & al., , & (). THCHS-30: A free chinese speech corpus. arXiv preprint arXiv:1512.01882. )
  • Wolof ( Citation: , & al., , , , & (). Collecting Resources in Sub-Saharan African Languages for Automatic Speech Recognition: a Case Study of Wolof. Retrieved from )
  • Librispeech ( Citation: , & al., , , & (). Librispeech: An asr corpus based on public domain audio books. IEEE. )


The datasets used for ABX evaluations can be downloaded from our repository using the zrc toolkit.


The abx15 benchmark uses the zr-2015 dataset which is based on the buckeye corpus, and the NCHLT Xitsonga We can not bundle these datasets due to restrictive licencing so you will have to download them from their website, and then using our toolkit import it, with the following command :

> zrc datasets:import zr2015-buckeye /path/to/buckeye-corpus
> zrc datasets:import zr2015-nchlt /path/to/nchlt_tso

The abx17 benchmark uses the zr-2017 dataset that you can download using the following command :

> zrc datasets:pull zrc2017-test-dataset

The abxLS dataset uses the zr2021-phonetic dataset, it can be downloaded from our repository using the following command :

> zrc datasets:pull abxLS-dataset