PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation

Resource type

Authors/contributors

Title

Abstract

PhonemeDF is a large-scale phoneme-level parallel dataset of real and synthetic speech (approximately 730 hours), designed for audio deepfake detection and speech naturalness evaluation. The dataset consists of real speech samples derived from a subset of the LibriSpeech corpus (train-clean-100) and corresponding synthetic speech generated using four Text-to-Speech (TTS) systems (MeloTTS, XTTS v2, Chatterbox TTS, and VITS) and three Voice Conversion (VC) systems (Chatterbox VC, FreeVC, and StarGAN VC). Each audio sample is paired with phoneme-level alignments obtained using the Montreal Forced Aligner (MFA) with the ARPAbet phoneme set. The dataset contains 28,539 real utterances and 199,773 synthetic utterances, totaling approximately 730 hours of speech, along with corresponding TextGrid files. All audio is standardized to 16 kHz. RESOURCES Code and additional resources:https://github.com/Vamshi-Nallaguntla/PhonemeDF Paper (arXiv):https://doi.org/10.48550/arXiv.2603.15037

Version

1.0.1

Date

2026-04-11

Repository

Zenodo

DOI

10.5281/ZENODO.19362790

Citation Key

nallaguntla.etal_2026

URL

https://zenodo.org/doi/10.5281/zenodo.19362790

Accessed

14/05/2026, 18:18

Short Title

PhonemeDF

Language

Library Catalog

DOI.org (Datacite)

License

Creative Commons Attribution 4.0 International

Citation

Nallaguntla, V., Fursule, A., Kshirsagar, S., & Avila, A. R. (2026). PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake Detection and Naturalness Evaluation (Version 1.0.1) [Dataset]. Zenodo. https://doi.org/10.5281/ZENODO.19362790

Audio Data

Synthetic Speech