StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Anonymous submission to Interspeech 2026

Abstract

We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%→49.2%) and +10% over the emotion-prompt variant, while maintaining strong privacy (EER 49.0%).

Index Terms: speaker anonymization, emotion preservation, streaming speech processing, knowledge distillation

Privacy–Emotion Trade-off

Privacy-emotion trade-off for streaming speaker anonymization methods

Figure 1: Privacy-emotion trade-off for streaming speaker anonymization methods. Our method (orange star) compared to prior streaming methods (triangles).

Proposed Method

Training and inference configurations

Figure 2: Training and inference configurations. Training: prompt and source share the same speaker but differ in emotion, forcing the model to generate emotional output from source content rather than copying prompt-specific patterns. Frame-level emotion distillation (ℒemo, dashed) on Slow AR acoustic hidden states provides additional learning signal. Inference: a neutral utterance from the target anonymous speaker conceals source identity while the finetuned model preserves source emotion; no latency is added over the baseline.

Supervised Finetuning (SFT) — Train on neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning.

Acoustic Emotion Distillation — Frame-level distillation from a pretrained emotion extractor into acoustic hidden states. The acoustic branch provides cleaner gradient flow, avoiding interference with content supervision on the semantic branch.

Zero Inference Overhead — At inference, the distillation head and Emotion Encoder are removed; the model operates with the same architecture and latency as the baseline.

Main Results

Method Type WER ↓ UAR ↑ EER-L ↑ EER-S ↑
Original 1.83 5.16
EASY [5] Offline 2.70 63.81 45.89
GenVC-small [12] Semi 8.20 34.23 48.48 15.94
SLT24 [17] Online 5.70 57.00 31.40 10.12
DarkStream [18] Online 8.75 34.73 47.26 21.83
TVTSyn [11] Online 5.35 37.32 47.55 14.57
StreamVoiceAnon [4]
vctk-1fix
Online 4.54 39.72 47.19 15.92
crema-emo-4rnd Online 6.59 44.59 46.53 18.63
Ours
pool-distill
Online 5.08 46.30 48.62 18.32
frame-distill Online 5.77 49.22 48.98 18.30

Table 1: Comparison with prior methods. ↑/↓: higher/lower is better. Bold: best streaming result; underline: second best. SLT24 is grayed out due to insufficient privacy (EER-L < 40%). Our two variants correspond to Exp4 (pool-distill) and Exp7 (frame-distill) in the ablation study (Table 2).

Emotion Spotlight: Sadness

The most dramatic improvement occurs for "sad": from 8.0% (baseline) to 42.6% (ours), a +431% relative gain. The emotion-prompt baseline nearly destroys sadness recognition (8% UAR, near random for 4 classes). Our method recovers sad emotion to 42.6% UAR, demonstrating that acoustic-level distillation captures the subtle prosodic cues that prompt-based approaches miss entirely.

Utterance 1
Original
crema-emo-4rnd
Ours (frame-distill)
Utterance 2
Original
crema-emo-4rnd
Ours (frame-distill)

Full Audio Comparison

3 utterances per emotion, 6 methods each. Click a waveform to play, or use "Play all" to hear methods sequentially.

Original
vctk-1fix
emo-4rnd
DarkStream
TVTSyn
Ours

Ablation Study

Model Components WER ↓ Emotion ↑ Privacy ↑
FT-CREMA Neu-Emo [SEP] StatPool Causal Distill Average Ang Hap Neu Sad EER-L EER-S
Baseline 4.54 39.735.881.933.18.0 47.1915.92
Exp1 5.00 41.136.379.635.513.2 45.7014.88
Exp2 5.16 45.335.375.948.221.7 47.3116.73
Exp3 5.25 47.434.872.950.731.2 47.4616.53
Exp4 5.08 46.340.975.144.225.0 48.6218.32
Exp5 5.32 48.540.365.253.634.8 48.1916.78
Exp6 Sem 6.23 48.248.766.749.727.7 47.9317.10
Exp7 Aco 5.77 49.238.862.852.742.6 48.9818.30

Table 2: Ablation study. ✓ indicates active components. StatPool/Causal: aggregation approach. Distill: distillation target branch (Sem = semantic, Aco = acoustic; – = acoustic by default). Bold: best among Exp1–7; underline: second best. All metrics on IEMOCAP following VoicePrivacy 2024 protocol.

Ablation Audio Samples

Compare key ablation configurations on sad utterances to hear the incremental effect of each component.