Dual-Space Constrained Face-Based Zero-Shot Text-to-Speech Synthesis
Audio Samples Comparison
For modular baselines (Face2Speech, SYNTHE-SEES, Face-StyleSpeech), we adopt their original face-voice alignment architectures within a unified TTS framework. All systems share the same acoustic model and speaker encoder, with differences in alignment module design and speaker embedding learning.
VoxCeleb2 Dataset
Face images from VoxCeleb2 and text sentences from LibriTTS test-clean set are used for synthesis.
| Face | Text | Ground Truth | FaceTTS | Face2Speech | SYNTHE-SEES | Face-StyleSpeech | DSC-TTS (Ours) |
|---|
LRS2 Dataset
Sample-level evaluation on LRS2 tri-modal dataset under corpus mismatch.
| Face | Text | Ground Truth | FaceTTS | Face2Speech | SYNTHE-SEES | Face-StyleSpeech | DSC-TTS (Ours) |
|---|