news-img

2022-08-20 15:17

Mozilla Common Voice dataset grows by 30% and reaches 87 languages

Share to:

Mozilla Common Voice is an open-source initiative to make voice technology more inclusive. Contributors donate speech data to a public dataset, which anyone can then use to train voice-enabled technology. Voice technology is no longer just the remit of smart speakers - access to banking, government services and health tech are all increasingly voice operated. If we want to make sure nobody is left behind, projects like Common Voice are essential.

Common Voice 8 is the most diverse multilingual open speech corpus in the world. This is largest release yet, thanks to a growing, committed community, and multi-sector resourcing from partners such as Gates, NVIDIA, and GIZ. It is now 18,000 hours, and 13 million voice clips - generated entirely by 200,000+ volunteer contributors around the world.

New languages in Common Voice 8 include Igbo, Marathi, Danish, Norwegian Nynorsk, Central Kurdish, Malayalam, Swahili, Erzya, Moksha, Macedonian and Santali (Ol Chiki).

Our communities of contributors around the world have collaborated, inspired and supported people in our crowdsourcing efforts to make this dataset possible. Each member provides a unique and lived perspective of their language's experiences and cultural context.

As part of this dataset release we would like to highlight the contributions of; the Common Voice Language Reps, Chris Chinenye Emezue, Joan Montané and Nart for exceptional sentence collection efforts via the CC0 process, Bülent Özden for community building for Turkish Community and Stefania Deleprete for their Common Voice Advocacy efforts. We would also like to congratulate the Uzbek, Luganda, Serbian, Hausa, Belarusian and Abkhaz communities for their amazing growth.

Partners like NVIDIA make use of the data to fuel exciting open source innovation projects. Research Scientist Vitaly Lavrukhin says “the latest release of Mozilla Common Voice is a great thing for the research communities. The data continues to be a core component of NVIDIA’s open source NeMo Automatic Speech Recognition models and we congratulate the team on significant growth to the dataset. NVIDIA will also release data preprocessing scripts in NeMo to facilitate reproducibility of research.”

The collaborative support of Gates Foundation’s, GIZ and FCDO in growing digital innovation to address inequality in East Africa through voice innovation is also bearing fruit, as Swahili has hit 500 hours in a matter of months. This is thanks to the work of amazing community fellows Britone Mwasaru (Kenya) and Rebecca Ryakitimbo (DRC/Tanzania) and machine learning fellow Kathleen Siminyu (Kenya).

You can download the Common Voice dataset here for free.

On the subject

247128a7-edd7-48be-aab8-d621d7d037e3

MITC, IT Park and UNDP in Uzbekistan organize hackathon "Voice AI Challenge Uzbekistan"

On December 24-26, 2021, the Voice Ai Challenge Uzbekistan hackathon will be held at Inha University in Tashkent, organized by
Read more

2021-11-19

f4151122-ba6a-4bb9-8dc8-7c20e6f160d3

Winners of the Voice Ai Challenge Uzbekistan hackathon!

From December 24 to 26 this year, the INHA University in Tashkent hosted the “Voice Ai Challenge Uzbekistan” hackathon, organized
Read more

2021-12-27

6c07bd88-9662-423f-9da1-b7a69198c163

🏕 The Voice-Camp project, organized as part of the UzbekVoice marathon has ended!

Out of 8,000+ applications submitted for participation in the camp, 150 participants were selected from different regions of our republic.

From
Read more

2022-11-16

On the subject