Skill:
heygen-translate · Invoke: /heygen-translate [video_url_or_path] [--to language] · Sourceheygen-translate localizes a finished video into 175+ languages. The presenter keeps their face, their voice is cloned into the target language, and their lips re-sync to the new audio — viewers see the same person speaking natively. You supply a source video and target languages; the engine handles transcription, translation, voice cloning, lip-sync, and optional burned-in captions.
This is not new-video generation. The performance, framing, and brand assets in the original are preserved — translation rides on top of what’s already there.
When to Use
- Localize an existing video — “translate this to Spanish”, “dub this into Japanese”, “I need this in 10 languages for a launch”.
- Same presenter, another language (not a new presenter — that’s heygen-video).
- Podcast / audio-only translation — “translate this podcast, keep my video”.
- High-stakes content where you want to review/edit subtitles before final render (the proofreads workflow).
Not for: creating new videos from scratch (use heygen-video and write the script in the target language), avatar creation (use heygen-avatar), or text-only translation. Use this skill only when a source video already exists.
Workflow
Four phases. Discovery is the only place the skill asks questions; the rest is silent until delivery.Discovery
Gathers the minimum: source video (URL, file path, or HeyGen asset_id) and target language(s) — asked open-ended, no picker. Confirms speaker count, content type, captions, and duration flexibility with smart defaults.
Pre-flight (silent)
Validates the language string against the canonical list, routes the source (public URL passes through; local files upload, max 32 MB), and classifies the content into one of five profiles to set the right flags.
Submit + Poll (silent)
Submits one job per language (batchable), then background-polls — 30s for the first 3 min, then 60s. Most finish in 5–15 min; some take 30+. Surfaces only on completion, hard failure, or a stall.
Content Profiles
The skill picks one profile from the source and sets flags accordingly — you rarely need to specify:| Profile | Use when | Key flags |
|---|---|---|
| Talking head (default) | One person to camera, clean audio | precision, speech enhancement, captions, dynamic duration |
| Podcast / audio-only | Static or absent visual | translate_audio_only, speech enhancement |
| Music / high-soundtrack | Background music interferes with speech | disable_music_track, speech enhancement |
| Multi-speaker | Two+ distinct speakers | Talking-head defaults + speaker_num (required — don’t guess) |
| Corporate / branded | Brand voice, glossary discipline | Talking-head defaults + brand_voice_id; consider proofreads |
Proofread Workflow
For high-stakes content, run a proofread session so the user can review and edit translated subtitles before the engine commits to a final render. Default ON for long videos (>3 min), corporate/branded content, high-stakes legal/medical/educational, and languages the user reads natively.What to Know Before Translating
Source quality is the ceiling
Source quality is the ceiling
Translation can’t improve on the source. Muffled audio, fast cuts, heavy face occlusion, or low resolution all degrade lip-sync and voice quality. Warn the user upfront in Pre-flight — never after a bad result. Lip-sync is best on stable, front-facing, well-lit shots ≥720p with minimal cuts.
Locale pairs compress and expand
Locale pairs compress and expand
en→zh, en→ja, en→ko run ~30% shorter; de→en, ja→en run longer; en→ar/he expand. Dynamic duration matters most for these — without it, en→zh sounds artificially slow. Regional variants matter too: Spanish (Spain) ≠ Spanish (Mexico); “Chinese (Mandarin, Simplified)” ≠ Cantonese ≠ Taiwanese Mandarin.
Formality and RTL
Formality and RTL
Register-heavy languages (Japanese keigo, Korean honorifics, German Sie/du, Thai) default to neutral-formal — proofread if you need to match a conversational source. RTL languages (Arabic, Hebrew, Urdu, Persian) render captions right-to-left and can collide with source lower-thirds; consider audio-only or caption styling review.
Cost & time
Cost & time
Translations bill by source duration. A 5-min video × 5 languages = ~25 billable minutes, each rendering in 10–20 min. Surface source-minutes × language-count and an honest render-time range — don’t quote dollar figures (they vary by plan).
Example Prompts
| Prompt | What happens |
|---|---|
| ”Translate this video to Spanish: [URL]“ | Single-language dub with voice clone + lip-sync. |
| ”Dub this launch video into French, German, and Japanese.” | Batch translation, one job per language, delivered as each completes. |
| ”Translate this podcast but keep my video.” | Audio-only mode — returns a translated MP3 track. |
| ”Localize this for our enterprise launch — I want to review the subtitles first.” | Proofread workflow with editable SRT before final render. |
View the full SKILL.md
Includes the full proofreads playbook, language-locale guide, asset routing, and failure-mode decoder.

