Updated Feb 15, 2025

Voice → avatar workflow 2025: ElevenLabs Studio 3.0 + HeyGen

We built a full pipeline: record/clone in ElevenLabs Studio 3.0, dub in multiple languages, then feed the tracks into an avatar generator without losing lip-sync. Here are the settings and the export checklist.

Tests were run on Studio 3.0 late-2024 models with 48 kHz WAV exports, then rendered in HeyGen and CapCut using translation and lip-sync enabled.

Use this as a repeatable template. Every step below is tested on short hooks (30–45 seconds) and mid-form explainers (4–8 minutes) so you can see where sync drifts and how to fix it quickly.

Baseline preset we keep loaded in Studio 3.0

Start with one consistent preset so you’re not guessing per language. This is the build that survived the most tool handoffs:

Voice settings: Stability 62–68, Clarity + Similarity 42–48, Style exaggeration off.
Safety: Watermark on, blocklist for brands/medical terms, “preserve punctuation” enabled for dubbing.
Input normalisation: Loudness -16 LUFS with -3 dB peak ceiling; de-ess at 6–8 kHz, light gate at -38 dB.
Export: WAV 48 kHz mono for voice; keep stems for music/SFX if you’ll remix later.

What changed in Studio 3.0

Lower latency on short clips, steadier consonants, fewer duplicated phonemes on concatenated exports.
Watermark on by default; optional blocked phrases for brand names and compliance terms.
Dubbing preserves punctuation and pauses better than 2024, so captions stay closer to source timing.
Stems export (voice/ambience) for cleaner mixing and easier last-mile tweaks in your NLE.

Capture and cleanup before cloning

Record 20–40 seconds of clean tone per speaker. Avoid room tone longer than two seconds so the model doesn’t learn extra noise.
Normalise to -16 LUFS with a transparent limiter; trim mouth clicks under -42 dB to avoid robotic tails after translation.
Add 200 ms of silence at head and tail; Studio 3.0 keeps those pauses, which helps captions align in later cuts.
Run a short listen test on plosives (“p/b/t”) and fricatives (“s/f”) before cloning. If they splash, redo the take instead of over-EQing.

Recommended dubbing settings

Keep watermark on; add sensitive names to the blocklist so the translator never rewrites them.
Enable “preserve punctuation”; manually tighten any pause longer than 900 ms on short hooks.
For multilingual runs, generate EN → FR → ES in one session so the tone stays consistent; DE/PL benefit from a -2% tempo reduction.
Export WAV + SRT per language, plus stems when music/SFX need to be remixed downstream.

Step-by-step dubbing workflow

Drop your cleaned script or SRT into Studio 3.0; keep sentences under 18 words for avatar tools that struggle with long visemes.
Render a reference pass, mark any phoneme repeats, then regenerate only those lines. Avoid whole-paragraph re-renders.
Export SRT with original timecodes. If you retime in CapCut/Descript later, keep a copy of this “source SRT” for back-sync.
Label files with lang_version_scene_take.wav so the avatar tool and NLE stay aligned.

Avatar handoff

Import the clean WAV into an avatar video tool and let it handle translation/lip-sync. Tests: EN→FR→ES stayed synced on short hooks; DE needed one manual retime for plosives.

Disable auto-normalise inside the avatar tool if you already mastered to -16 LUFS; double-normalising adds pumping.
Keep viseme smoothing at default; cranking it up makes consonants drift out of sync after translation.
For portrait avatars, avoid jump cuts tighter than 6 frames; the mouth pose resets and looks like a glitch.

Export recipe: Studio → HeyGen → CapCut

Studio 3.0: Export WAV 48 kHz mono + SRT; keep stems if you plan to add music later.
HeyGen: Import WAV, set language to match file, leave lip-sync strength at default. Render a 1080p draft to inspect mouth shapes.
CapCut: Swap in the final 4K render only after checking SRT against the draft. Apply light compression (-2 dB makeup, ratio 2:1) if you add music.
Final QC: Peaks below -1 dBFS; SRT lines under 42 characters; no brand terms translated; visual frames free of jump-cut mouth resets.

Checklist before export

Waveform peaks below -1 dBFS; loudness -16 LUFS ±1; no broadband hiss above -55 dB in the tail.
SRT lines under 42 characters; two lines max; no orphaned punctuation after translation.
For multilingual, verify brand terms are not auto-translated and diacritics render correctly in the avatar output.
Export 1080p draft from the avatar tool, then final 4K once timing is locked; archive stems for remix requests.

Common failure modes (and fixes)

Choppy plosives after translation: Drop clarity to 40–42 and regenerate the affected line only.
Captions drift mid-sentence: Split the sentence into two lines in Studio, regenerate, keep the SRT split.
Avatar mouth lags on jump cuts: Insert a 6–8 frame pre-roll of silence before the line; keeps visemes in sync.
Music pumping: Disable per-track normalisation in the avatar tool; compress in CapCut/Premiere instead.

FAQ

Does the watermark stay? Yes by default; remove only if you own rights.
Which languages stayed in sync? EN, FR, ES stable; DE needs a quick review.
Music bed? Add after avatars are rendered to avoid ducking issues.

Templates you can copy

Paste this timing-safe SRT skeleton into Studio before dubbing; adjust only the text to keep visemes predictable:

1
00:00:00,000 --> 00:00:03,200
Hook text here, under 18 words.

2
00:00:03,400 --> 00:00:07,000
Keep pauses short; avoid stacked commas.

Affiliate transparency: some links may earn a commission at no extra cost to you.