ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
Resource type
Authors/contributors
- Garg, Ashi (Author)
- Cai, Zexin (Author)
- Xinyuan, Henry Li (Author)
- García-Perera, Leibny Paola (Author)
- Duh, Kevin (Author)
- Khudanpur, Sanjeev (Author)
- Wiesner, Matthew (Author)
- Andrews, Nicholas (Author)
Title
ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
Abstract
This repository introduces: 🌀 ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts
🔥 Key Features
3000+ hours of synthetic speech
Diverse Distribution Shifts: The dataset spans 7 key distribution shifts, including:
📖 Reading Style
🎙️ Podcast
🎥 YouTube
🗣️ Languages (Three different languages)
🌎 Demographics (including variations in age, accent, and gender)
Multiple Speech Generation Systems: Includes data synthesized from various TTS models and vocoders.
💡 Why We Built This Dataset
Driven by advances in self-supervised learning for speech, state-of-the-art synthetic speech detectors have achieved low error rates on popular benchmarks such as ASVspoof. However, prior benchmarks do not address the wide range of real-world variability in speech. Are reported error rates realistic in real-world conditions? To assess detector failure modes and robustness under controlled distribution shifts, we introduce ShiftySpeech, a benchmark with more than 3000 hours of synthetic speech from 7 domains, 6 TTS systems, 12 vocoders, and 3 languages.
Citation Key
_bg
Citation
Garg, A., Cai, Z., Xinyuan, H. L., García-Perera, L. P., Duh, K., Khudanpur, S., Wiesner, M., & Andrews, N. (n.d.). ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts [Dataset]. Retrieved https://huggingface.co/datasets/ash56/ShiftySpeech
Audio Data
Link to this record