Updated Feb 15, 2025

Voice → avatar workflow 2025: ElevenLabs Studio 3.0 + HeyGen

We built a full pipeline: record/clone in ElevenLabs Studio 3.0, dub in multiple languages, then feed the tracks into an avatar generator without losing lip-sync. Here are the settings and the export checklist.

Tests were run on Studio 3.0 late-2024 models with 48 kHz WAV exports, then rendered in HeyGen and CapCut using translation and lip-sync enabled.

Use this as a repeatable template. Every step below is tested on short hooks (30–45 seconds) and mid-form explainers (4–8 minutes) so you can see where sync drifts and how to fix it quickly.

Baseline preset we keep loaded in Studio 3.0

Start with one consistent preset so you’re not guessing per language. This is the build that survived the most tool handoffs:

What changed in Studio 3.0

Capture and cleanup before cloning

  1. Record 20–40 seconds of clean tone per speaker. Avoid room tone longer than two seconds so the model doesn’t learn extra noise.
  2. Normalise to -16 LUFS with a transparent limiter; trim mouth clicks under -42 dB to avoid robotic tails after translation.
  3. Add 200 ms of silence at head and tail; Studio 3.0 keeps those pauses, which helps captions align in later cuts.
  4. Run a short listen test on plosives (“p/b/t”) and fricatives (“s/f”) before cloning. If they splash, redo the take instead of over-EQing.

Recommended dubbing settings

  1. Keep watermark on; add sensitive names to the blocklist so the translator never rewrites them.
  2. Enable “preserve punctuation”; manually tighten any pause longer than 900 ms on short hooks.
  3. For multilingual runs, generate EN → FR → ES in one session so the tone stays consistent; DE/PL benefit from a -2% tempo reduction.
  4. Export WAV + SRT per language, plus stems when music/SFX need to be remixed downstream.

Step-by-step dubbing workflow

  1. Drop your cleaned script or SRT into Studio 3.0; keep sentences under 18 words for avatar tools that struggle with long visemes.
  2. Render a reference pass, mark any phoneme repeats, then regenerate only those lines. Avoid whole-paragraph re-renders.
  3. Export SRT with original timecodes. If you retime in CapCut/Descript later, keep a copy of this “source SRT” for back-sync.
  4. Label files with lang_version_scene_take.wav so the avatar tool and NLE stay aligned.

Avatar handoff

Import the clean WAV into an avatar video tool and let it handle translation/lip-sync. Tests: EN→FR→ES stayed synced on short hooks; DE needed one manual retime for plosives.

Export recipe: Studio → HeyGen → CapCut

  1. Studio 3.0: Export WAV 48 kHz mono + SRT; keep stems if you plan to add music later.
  2. HeyGen: Import WAV, set language to match file, leave lip-sync strength at default. Render a 1080p draft to inspect mouth shapes.
  3. CapCut: Swap in the final 4K render only after checking SRT against the draft. Apply light compression (-2 dB makeup, ratio 2:1) if you add music.
  4. Final QC: Peaks below -1 dBFS; SRT lines under 42 characters; no brand terms translated; visual frames free of jump-cut mouth resets.

Checklist before export

Common failure modes (and fixes)

FAQ

Templates you can copy

Paste this timing-safe SRT skeleton into Studio before dubbing; adjust only the text to keep visemes predictable:

1
00:00:00,000 --> 00:00:03,200
Hook text here, under 18 words.

2
00:00:03,400 --> 00:00:07,000
Keep pauses short; avoid stacked commas.

Affiliate transparency: some links may earn a commission at no extra cost to you.