# How to participate

## Choosing a train dataset

You can train on any of the standard ZeroSpeech Task 4 train sets listed on the Benchmarks and Datasets page, together or combined. You can also train on external datasets, as long as they are publicly available. During the submission process, you will be asked to specify what dataset was used to train your system, providing a link (or publication reference) if it is an external dataset.

The provided datasets can be downloaded using our toolkit or directly using the provided URLs in our repository.

## Using our toolkit

It is recommended to install and use our toolkit to manage, evaluate & upload your submissions. The toolkit consists of a python package containing evaluation scripts, scripts to download datasets & other relevant files, also scripts to facilitate uploading of results to the leaderboards. You can find instructions on how to download and use our toolkit here

## Submission Preparation

Each benchmark requires a specific set of files to be prepared.

To facilitate this you can use the zrc submission:init sLM21 <location> command from the toolkit to create an empty submission template folder. <location> is the path where the directory will be created

### meta.yaml

This file contains meta information about the author and how this submission was created.

example :

model_info:
model_id: null
gpu_budget: 60
system_description: "CPC-big (trained on librispeech 960), kmeans (trained on librispeech 100), LSTM. See https://zerospeech.com/2021 for more details."
train_set: "librispeech 960, librispeech 100"
publication:
author_label: "Nguyen et al."
authors: "Nguyen, T., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E. & Dupoux, E."
paper_title: "The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling."
paper_url: "https://arxiv.org/abs/2011.11589"
publication_year: 2021
institution: "EHESS, ENS, PSL Research University, CNRS and Inria"
team: "CoML Team"
code_url: "https://github.com/zerospeech/zerospeech2021_baseline"
open_source: true


To Note

While most of the information in meta.yaml is optional, we appreciate if you take the time and fill this information as it allows us to verify the submissions and be able to keep track of all the systems that use our benchmarks.

We also would appreciate if you made your code open source and provided a link to it, although we understand that this is not always possible.

### params.yaml

This file contains various parameters that can override the defaults of each benchmark.

semantic:
metric:  <str>
The metric to use for semantic evaluation. May be any metric
supported by scipy.spatial.distance.cdist.
n_jobs: <int> accelerate semantic evaluation by adding multiple processes
pooling: <str>
The pooling method to use for semantic evaluation, must be 'min',
'max', 'mean', 'sum', 'last' or 'lastlast'.


### model outputs

For each of the tasks a model output is required.

##### /lexical and /syntactic

The /lexical and /syntactic folders of the submission must contain the two files dev.txt and test.txt. For each *.wav file in the dataset must correspond a line either in dev.txtor test.txt with its corresponding pseudo-probability (order does not matter). For example if the dev dataset contains:

   /path/to/dataset/lexical/dev
├── aAAfmkmQpVz.wav
├── AaaggUZsvkR.wav
├── aAakhKfuvQI.wav
├── aAaOswLeeBL.wav
├── AaasVuoMJnS.wav


The submitted file dev.txt must contain entries like:

   aAAfmkmQpVz -313.37445068359375
AaaggUZsvkR -447.8950500488281
aAakhKfuvQI -383.8902587890625
aAaOswLeeBL -430.2048645019531
AaasVuoMJnS -356.9426574707031

##### /semantic

The semantic folder of the submission must contain the following subdirectories: dev/synthetic, dev/librispeech, test/synthetic and test/librispeech.

• Each .wav file in the dataset must have its corresponding .npy file in the submission under the same directory structure. For example the dataset file /path/to/dataset/semantic/dev/synthetic/aAbcsWWKCz.wav must have its submitted file /path/to/submission/semantic/dev/synthetic/aAbcsWWKCz.npy.

• Each .npy file encodes a single 2D numpy array of floats, each line encoding one features frame.

• The number of columns (the features dimension) must be constant across the files. The number of lines depends on the speech sample duration.

• The metric and pooling method used for evaluation must be specified in params.yaml.

## Running the evaluation

Once the submission has been successfully created we can now run the evaluation.

• zrc benchmarks:run sLM21 </path/to/submission> -o scores_dir

Your results are created in the scores_dir directory.

Notes:

• A validation will run before each evaluation to skip use option --skip-validation
• If the dataset has subsets you can run the eval on only a selected subset --sets dev
• If the benchmark has multiple sub tasks you can run your benchmark on a selected subtask using --task lexical semantic

We appreciate if you upload your results so that we can compile them into our leaderboards, this helps us with a couple of ways :

• It allows us to follow new systems that are evaluated on our benchmarks and compare them.
• It also helps us with creating a central place where all systems trying to solve unsupervised speech processing can be indexed.
• It shows that interest in our benchmarks is still active and motivates us to create more

To submit your results you need to create an account on our website (if one is not already available).

Using the toolkit create a local session zrc user:login provide your username & password.

Once this is done you can upload using the following command zrc upload:scores <score_dir> <submission_dir>

## Multiple Submissions

If your system can be used for multiple tasks (for example, Task 1 and Task 3, Task 1 and Task 4), you are strongly encouraged to make submission to all the tasks you can. To link submissions of a single system you need to use the same model_id in your meta.yaml auto-generated after the first submission.