Acoustic Unit Discovery / Speech Representation Learning

Task and Goals

The goal of acoustic unit discovery is to learn representations (embeddings) of speech sounds that retain linguistically relevant information and discard linguistically irrelevant acoustic information like speaker voice type or recording conditions (additive noise, reverberation, etc). In text-based systems, such representations are phonemes (as defined by a pronunciation dictionary) or characters. Here, the representations are latent and may take any form (dense vectors for each frame, probabilistic codes, discrete codes, etc) as long as they can be aligned with the original signal (for instance, a vector of values every 10 ms).

To evaluate these representations, we take the view that while they may not correspond one-to-one to linguistically interpretable units (phonemes, phonetic features, syllables, etc.), or may not even be discrete, they should at least support the same key function: linguistic contrast. Phonemes are defined as the smallest element of speech that make a difference in meaning between words (e.g., /bit/ versus /bet/). Here, we require representations to distinguish pairs of triphones that differ minimally in their center phone, ignoring variations in speaker or recording condition, and irrespective of whether the triphones are actual words or not. Large numbers of triphone pairs are obtained automatically by mining a speech dataset, and discriminability is computed by running an ABX discrimination test ( Citation: , (). ABX-discriminability measures and applications  (PhD thesis). Paris 6 ) .

Metrics: minimal triphone ABX

The minimal pair ABX task ( Citation: , & al., , , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline. ) , ( Citation: , & al., , , , , & (). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise. ) does not require any training and only requires to define a dissimilarity d between the representations of speech tokens. It is inspired by match-to-sample tasks used in human psychophysics and is a simple way to measure discriminability between two sound categories (where the sounds a and b belong to different categories $\mathbf{A}$ and $\mathbf{B}$, respectively, and the task is to decide whether the sound x belongs to one or the other).

Specifically, we define the ABX-discriminability of category $\mathbf{A}$ from category $\mathbf{B}$ as the probability that a and x are further apart than b and x according to some distance d over the (model-dependent) representations for these sounds when a and x are from category $\mathbf{A}$ and b is from category $\mathbf{B}$. Given a set of sounds $S(\mathbf{A})$ from category $\mathbf{A}$ and a set of sounds $S(\mathbf{B})$ from category $\mathbf{y}$, we estimate this probability using the following formula:

$$ \hat{\theta}(\mathbf{A}, \mathbf{B}) := \frac{1}{m(m-1)n} \sum_{a\in S(\mathbf{A})} \sum_{b\in S(\mathbf{B})} \sum_{x\in S(\mathbf{A}) \setminus {a}} (\mathbb{1}{d(a,x)<d(b,x)} + \frac{1}{2}\mathbb{1}{d(a,x)=d(b,x)}) $$

where $m$ and $n$ are the number of sounds in $S(\mathbf{A})$ and $S(\mathbf{B})$ and $\mathbb{1}$ denotes an indicator function. Note that $\hat{\theta}(\mathbf{A}, \mathbf{B}) $ is asymmetric in the two categories. We obtain a symmetric measure by taking the average of the ABX discriminability of $\mathbf{A}$ from $\mathbf{B}$ and of $\mathbf{B}$ from $\mathbf{A}$.

We do not require $d$ to be a metric in the mathematical sense. The default distances provided in this challenge are based on DTW divergences with the underlying frame-to-frame distance being either angular distance (arc cos of normalized dot-product) or KL-divergence. For most systems (signal processing, embeddings) the angular distance usually gives good results, and for others (posteriorgrams) the KL distance is more appropriate. Contestants can experiment with their own distance functions if they wish, as long as it was not obtained through supervised training.

In all iterations of the ZRC series so far, categories $\mathbf{A}$ and $\mathbf{B}$ are sequences of 3 phonemes that differ in the central sound (not necessarily real words, e.g., “beg”–“bag”, “api”–“ati”, etc). The compound measure sums over all minimal pairs of this type found in the corpus in a structured manner, that depends on the task.

Within speaker ABX

For the within-speaker version of the task, all of the phone triplets belong to the same speaker (e.g. $A = beg_{T_1}$ , $B = bag_{T_1}$, $X = bag\prime_{T_1}$ ). The scores for a given minimal pair are first averaged across all the speakers for which this minimal pair exists. The resulting scores are then averaged over all found contexts for a given pair of central phones (e.g. for the pair /a/-/e/, average the scores for the existing contexts such as /b_g/, /r_d/, /f_s/, etc.). Finally, the scores for every pair of central phones are averaged and subtracted from 1 to yield the reported within-talker ABX error rate.

Across speaker ABX

For the across-speaker task, $A$ and $B$ belong to the same speaker, and $X$ to a different one. $A = beg_{T_1}$, $B = bag_{T_1}$, $X = bag\prime_{T_2}$ . The scores for a given minimal pair are first averaged across all the pairs of speakers for which this contrast can be made. As above, the resulting scores are averaged over all contexts over all pairs of central phones and converted to an error rate.

Relationship with other metrics

As shown in Schatz ( Citation: , (). ABX-discriminability measures and applications  (Doctoral dissertation). École Normale Supérieure ) , ABX is an unbiased statistically efficient metric of discriminability (compared to supervised methods like LDA, or unsupervised ones like kmeans and KNN) and can be seen as good predictor of the result of an unsupervised clustering method.

Figure. 2. Illustration of category discriminability for different values of ABX.

Bibliography

$^*$The full bibliography can be found here

Schatz, Peddinti, Bach, Jansen, Hermansky & Dupoux (2013)
, , , , & (). Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline.
Schatz, Peddinti, Cao, Bach, Hermansky & Dupoux (2014)
, , , , & (). Evaluating speech features with the minimal-pair ABX task (II): Resistance to noise.
Schatz (2016)
(). ABX-discriminability measures and applications  (Doctoral dissertation). École Normale Supérieure
Schatz (2016)
(). ABX-discriminability measures and applications  (PhD thesis). Paris 6