tl;dr

• The Zero Resource Speech Challenge (ZRC) series ultimate aim is to learn a spoken dialogue system from raw audio (no text, no labels!)
• in a replicable and cumulative way.
• It currently supports 4 challenging subtasks (T1: Acoustic Unit Discovery, T2: Spoken Term Discovery, T3: Discrete Resynthesis, and T4: Spoken Language Modeling) with associated metrics, benchmarks and leaderboards.

The ZRC series is now open on a rolling basis. Here, you’ll find here step-by-step guides on how to submit your own contributions and appear on the leaderboard! You’ll also be able to download past results to do your own analysis and model comparison.

What?

For hearing humans, speech is the primary form of communication. Most spoken languages have little or no associated textual resources, and in all cultures, young children learn to speak before they learn to read ( ( Citation: (). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. arXiv preprint arXiv:1607.08723. ) , ( Citation: Bavin, E. (). The Cambridge handbook of child language. Cambridge University Press. Retrieved from http://site.ebrary.com/id/10303044 ) ). Yet, current language technology is overwhelmingly based on text. Even spoken dialogue systems are based on text (using ASR and TTS to convert speech to text and back to speech, Figure 1a). Could we get rid of text and build language processing from raw audio?

Why?

• It is an interesting self-supervised machine learning problem.
• It would open up applications for thousands of languages that are mostly or entirely unwritten, making AI more inclusive.
• Even in “high resource” languages, speech conveys aspects of language poorly represented in text (prosody, emotional and non-verbal vocalizations, oral expressions, etc). Speech-based systems could be more expressive than text-based ones.
• Self-supervised systems could provide predictive models of language development in a normal or abnormal setting (dyslexia, etc).

How?

The ZRC series addresses two interlocking research problems: the task problem and the evaluation problem.

The task problem is to break down the overall objective into a series of well-defined sub-problems. The ZRC series follows the general architecture in Figure 1b: the acoustic model, lexicon, language model, waveform generation, and so on. But instead of using phonemes, characters or words as an intermediate representation, the components develop their own latent representations. For instance, instead of letters, the acoustic model outputs “acoustic units” which may or may not be discrete .

Task Ling. Level Metric Model outputs examined Example
T1. Unit Discovery Phonetic ABX: $d(a,x)<d(b,x)?$
$a\in A, b\in B, x\neq a\in A$
triplets of frame embeddings /bit$_{T1}$/, /bet$_{T1}$/, /bit$_{T2}$/
T2. Spoken Term Discovery Lexical
(matching)
NED: $ED(a,b)/max(|a|,|b|)$
COV: fraction of corpus covered
pairs of speech segments
(segment:pair of time-stamps)
a $[rose]$ is a $[rose]$ ...
Lexical
(clustering)
Grouping F-score
Type F-score
clusters of speech segments a $[rose]$ is a $[rose]$ is a $[rose]$
Lexical
(segmentation)
Token F-score: as defined in text
Boundary F-score: segmentation
list of time-stamps a$|$rose$|$is$|$a$|$rose$|$isa$|$ro$|$se$|$
T3. Unsup. Discrete Resynthesis ("TTS without T") Phonetic Bitrate: $\frac{n}{D(U)}\sum{p(s_i)log( p(s_i)) }$
MOS: human evaluation
series of discrete units
waveforms
$U=s_{1},..,s_{n}$
T4. Spoken LM Lexical
(frequency)
spot-the-word: $\hat{p}(a)>\hat{p}(b)?$ pairs of (pseudo) probabilities $\hat{p}$(brick), $\hat{p}$($^*$blick)
Lexical
(semantics)
similarity: $d(a,b) \propto d_{h}(a,b) ?$ pairs of word embeddings $d_{h}$(abduct, kidnap) : 8.63
$d_{h}$(abduct, rotate): 0.5
Syntax accept. judgment: $\hat{p}(a)>\hat{p}(b)?$ pairs of (pseudo) probabilities $\hat{p}$(dogs eat meat), $\hat{p}$($^*$dogs eats meat)
Table I. Summary of the metrics and tasks used in the Zero Resource Challenge Series.
• $d$ is a dissimilarity measure between embeddings ($d_h$ is from human judgments).
• $\hat{p}$ is a pseudo-probability computed by the LM over the entire input sequence.
• $ED$ is the edit distance over the phonetic transcriptions of the discovered segments.
• $D(U)$ is the duration of the utterance $U$ (in sec)

The evaluation problem is to define metrics that enable model comparison and cumulative progress. In the ZRC series, we use zero-shot probe tasks that are inspired by human psycholinguistics, and which require no model retraining and reflect more directly the quality of the latent representations than if we were to train a classifier.

For each task, zero-shot metrics were developed that probe for the different levels of linguistic knowledge learned by the system. They only require the extraction of information readily available across systems (embeddings, pseudo-probabilities). The evaluation metrics that go with the tasks are in Table I.

Each of the tasks, metrics and benchmarks are explained here:

We also give a link to the archives of the past challenges (2015, 2017, 2019, 2020, 2021 editions).

Bibliography

$^*$The full bibliography can be found here

Bavin (2009)
Bavin, E. (). The Cambridge handbook of child language. Cambridge University Press. Retrieved from http://site.ebrary.com/id/10303044
Dupoux (2016)
(). Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. arXiv preprint arXiv:1607.08723.