Dual-Space Constrained Face-Based Zero-Shot Text-to-Speech Synthesis

Anonymous Submission
Interspeech 2026

Audio Samples Comparison

For modular baselines (Face2Speech, SYNTHE-SEES, Face-StyleSpeech), we adopt their original face-voice alignment architectures within a unified TTS framework. All systems share the same acoustic model and speaker encoder, with differences in alignment module design and speaker embedding learning.

VoxCeleb2 Dataset

Face images from VoxCeleb2 and text sentences from LibriTTS test-clean set are used for synthesis.

Face Text Ground Truth FaceTTS Face2Speech SYNTHE-SEES Face-StyleSpeech DSC-TTS (Ours)

LRS2 Dataset

Sample-level evaluation on LRS2 tri-modal dataset under corpus mismatch.

Face Text Ground Truth FaceTTS Face2Speech SYNTHE-SEES Face-StyleSpeech DSC-TTS (Ours)