Skip to main content
Skill: heygen-translate · Invoke: /heygen-translate [video_url_or_path] [--to language] · Source
heygen-translate localizes a finished video into 175+ languages. The presenter keeps their face, their voice is cloned into the target language, and their lips re-sync to the new audio — viewers see the same person speaking natively. You supply a source video and target languages; the engine handles transcription, translation, voice cloning, lip-sync, and optional burned-in captions. This is not new-video generation. The performance, framing, and brand assets in the original are preserved — translation rides on top of what’s already there.

When to Use

  • Localize an existing video“translate this to Spanish”, “dub this into Japanese”, “I need this in 10 languages for a launch”.
  • Same presenter, another language (not a new presenter — that’s heygen-video).
  • Podcast / audio-only translation“translate this podcast, keep my video”.
  • High-stakes content where you want to review/edit subtitles before final render (the proofreads workflow).
Not for: creating new videos from scratch (use heygen-video and write the script in the target language), avatar creation (use heygen-avatar), or text-only translation. Use this skill only when a source video already exists.

Workflow

Four phases. Discovery is the only place the skill asks questions; the rest is silent until delivery.
1

Discovery

Gathers the minimum: source video (URL, file path, or HeyGen asset_id) and target language(s) — asked open-ended, no picker. Confirms speaker count, content type, captions, and duration flexibility with smart defaults.
2

Pre-flight (silent)

Validates the language string against the canonical list, routes the source (public URL passes through; local files upload, max 32 MB), and classifies the content into one of five profiles to set the right flags.
3

Submit + Poll (silent)

Submits one job per language (batchable), then background-polls — 30s for the first 3 min, then 60s. Most finish in 5–15 min; some take 30+. Surfaces only on completion, hard failure, or a stall.
4

Deliver

Posts one message per completed language with the video URL, duration, mode, and caption status. Delivers each as it finishes rather than waiting for the whole batch.

Content Profiles

The skill picks one profile from the source and sets flags accordingly — you rarely need to specify:
ProfileUse whenKey flags
Talking head (default)One person to camera, clean audioprecision, speech enhancement, captions, dynamic duration
Podcast / audio-onlyStatic or absent visualtranslate_audio_only, speech enhancement
Music / high-soundtrackBackground music interferes with speechdisable_music_track, speech enhancement
Multi-speakerTwo+ distinct speakersTalking-head defaults + speaker_num (required — don’t guess)
Corporate / brandedBrand voice, glossary disciplineTalking-head defaults + brand_voice_id; consider proofreads
Always precision mode unless the user explicitly asks for speed. Always confirm speaker count for non-obvious cases — wrong speaker count is the #1 quality killer, causing voices to bleed across speakers.

Proofread Workflow

For high-stakes content, run a proofread session so the user can review and edit translated subtitles before the engine commits to a final render. Default ON for long videos (>3 min), corporate/branded content, high-stakes legal/medical/educational, and languages the user reads natively.
proofreads create → poll → fetch SRT → edit (glossary, register, names) →
upload edited SRT → generate final → poll → deliver
This gives you an editable SRT you can also use as a sidecar caption file — preferable to burned-in captions when you might want to restyle them later.

What to Know Before Translating

Translation can’t improve on the source. Muffled audio, fast cuts, heavy face occlusion, or low resolution all degrade lip-sync and voice quality. Warn the user upfront in Pre-flight — never after a bad result. Lip-sync is best on stable, front-facing, well-lit shots ≥720p with minimal cuts.
en→zh, en→ja, en→ko run ~30% shorter; de→en, ja→en run longer; en→ar/he expand. Dynamic duration matters most for these — without it, en→zh sounds artificially slow. Regional variants matter too: Spanish (Spain) ≠ Spanish (Mexico); “Chinese (Mandarin, Simplified)” ≠ Cantonese ≠ Taiwanese Mandarin.
Register-heavy languages (Japanese keigo, Korean honorifics, German Sie/du, Thai) default to neutral-formal — proofread if you need to match a conversational source. RTL languages (Arabic, Hebrew, Urdu, Persian) render captions right-to-left and can collide with source lower-thirds; consider audio-only or caption styling review.
Translations bill by source duration. A 5-min video × 5 languages = ~25 billable minutes, each rendering in 10–20 min. Surface source-minutes × language-count and an honest render-time range — don’t quote dollar figures (they vary by plan).

Example Prompts

PromptWhat happens
”Translate this video to Spanish: [URL]“Single-language dub with voice clone + lip-sync.
”Dub this launch video into French, German, and Japanese.”Batch translation, one job per language, delivered as each completes.
”Translate this podcast but keep my video.”Audio-only mode — returns a translated MP3 track.
”Localize this for our enterprise launch — I want to review the subtitles first.”Proofread workflow with editable SRT before final render.

View the full SKILL.md

Includes the full proofreads playbook, language-locale guide, asset routing, and failure-mode decoder.