PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation
Resource type
Authors/contributors
- Nallaguntla, Vamshi (Author)
- Fursule, Aishwarya (Author)
- Kshirsagar, Shruti (Author)
- Avila, Anderson R. (Author)
Title
PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation
Abstract
PhonemeDF is a large-scale phoneme-level parallel dataset of real and synthetic speech (approximately 730 hours), designed for audio deepfake detection and speech naturalness evaluation.
The dataset consists of real speech samples derived from a subset of the LibriSpeech corpus (train-clean-100) and corresponding synthetic speech generated using four Text-to-Speech (TTS) systems (MeloTTS, XTTS v2, Chatterbox TTS, and VITS) and three Voice Conversion (VC) systems (Chatterbox VC, FreeVC, and StarGAN VC). Each audio sample is paired with phoneme-level alignments obtained using the Montreal Forced Aligner (MFA) with the ARPAbet phoneme set.
The dataset contains 28,539 real utterances and 199,773 synthetic utterances, totaling approximately 730 hours of speech, along with corresponding TextGrid files. All audio is standardized to 16 kHz.
RESOURCES
Code and additional resources:https://github.com/Vamshi-Nallaguntla/PhonemeDF
Paper (arXiv):https://doi.org/10.48550/arXiv.2603.15037
Version
1.0.1
Date
2026-04-11
Repository
Zenodo
Citation Key
nallaguntla.etal_2026
Accessed
14/05/2026, 18:18
Short Title
PhonemeDF
Language
en
Library Catalog
DOI.org (Datacite)
License
Creative Commons Attribution 4.0 International
Citation
Nallaguntla, V., Fursule, A., Kshirsagar, S., & Avila, A. R. (2026). PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation (Version 1.0.1) [Dataset]. Zenodo. https://doi.org/10.5281/ZENODO.19362790
Audio Data
Link to this record