PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Dual-Space Constrained Face-Based Zero-Shot Text-to-Speech Synthesis

Anonymous Submission
Interspeech 2026

Audio Samples Comparison

For modular baselines (Face2Speech, SYNTHE-SEES, Face-StyleSpeech), we adopt their original face-voice alignment architectures within a unified TTS framework. All systems share the same acoustic model and speaker encoder, with differences in alignment module design and speaker embedding learning.

VoxCeleb2 Dataset

Face images from VoxCeleb2 and text sentences from LibriTTS test-clean set are used for synthesis.

Face	Text	Ground Truth	FaceTTS	Face2Speech	SYNTHE-SEES	Face-StyleSpeech	DSC-TTS (Ours)

LRS2 Dataset

Sample-level evaluation on LRS2 tri-modal dataset under corpus mismatch.

Face	Text	Ground Truth	FaceTTS	Face2Speech	SYNTHE-SEES	Face-StyleSpeech	DSC-TTS (Ours)