Wan 2.2 : Discover Wan 2.2 5b Ovi in ComfyUI the groundbreaking 10s audio video model that brings your typed words to life with perfect lip sync and motion. Dive into setup tips examples and why its a game changer for creators
Breathing Life into Still Images with Just Words
Picture this. You grab a simple photo of a friend mid laugh. Type a quirky line like “Remember that time we chased fireflies until dawn?” Hit generate. Seconds later a full
10 second clip plays back. Your friend speaks the words with natural lip movements eyes sparkling with emotion and subtle head tilts that feel eerily real.
No clunky editing. No separate audio tracks. Just pure seamless magic.This isnt science fiction. Its Wan 2.2 5b Ovi in ComfyUI the new 10s audio video model that says anything you type.
Released in October 2025 by Character AI in collaboration with Yale researchers Ovi fuses video and sound generation into one powerhouse workflow.
github.com Built on the robust WAN 2.2 5B backbone for visuals and MMAudio for crisp audio it turns text prompts or image text combos into synchronized clips that rival closed source giants like Google Veo 3. But heres the twist its open source and runs locally in ComfyUI your favorite node based haven for
AI experimentation.As someone whos spent countless nights tweaking Stable Diffusion workflows I was skeptical at first. Multimodal models promising audio
video sync often deliver uncanny valley results lips flapping out of rhythm or voices that sound like robots gargling gravel. Ovi changed that.
In my first test I fed it a vintage portrait and a prompt about lost loves. The output? A heartfelt monologue with breathy pauses and micro expressions that made me pause the video
just to replay the nuance. This post dives deep into why Ovi stands out how to set it up in ComfyUI real world examples and fresh tips drawn from hands on tinkering.
Whether youre a storyteller marketer or hobbyist ready to level up your content game lets unpack this gem. Read more SIMA 2: Google DeepMind Leap Forward in AI Agents for Virtual 3D Worlds
Unpacking Ovi What Makes This Model Tick
At its core Ovi represents a leap in cross modal AI. Traditional video generators like Kling or Runway spit out silent clips leaving you to dub audio separately and pray for decent lip sync. Ovi flips the script by modeling video and audio as intertwined processes from the jump.
arxiv.org Think twin backbones one handling visuals via WAN 2.2s mixture of experts architecture the other crafting sound with MMAudio all fused through clever attention mechanisms.
WAN 2.2 itself a 5 billion parameter beast from Alibaba already wowed us with cinematic quality in text to video tasks.
goenhance.ai It excels at coherent motion think flowing hair or believable walks without the jittery artifacts plaguing earlier models. Ovi layers on
MMAudio a 5B param audio synthesizer trained for expressive speech and ambient layers. The result? 10 second clips at 960×960 resolution where dialogue isnt just spoken its embodied.Key specs that hooked me:
- Length and Resolution: Base is 5 seconds but Ovi 1.1 doubles to 10 seconds at square 960×960 perfect for social reels or quick ads.
- Input Flexibility: Pure text for from scratch scenes or image plus text to animate existing shots.
- Prompt Magic: Wrap speech in <S> tags like <S>Hello world<E> and add Audio: soft rain in background for layered soundscapes.
- Sync Precision: Lip movements align frame by frame with phonemes thanks to fused training on diverse video datasets 100% larger than predecessors. github.com
What sets Ovi apart isnt just the tech. Its the human touch baked in. During training researchers emphasized diversity pulling from global accents emotions and scenarios.
In my runs a prompt for a bustling Tokyo street yielded not just chatter but overlapping voices with realistic echoes. No more flat monologues this model feels alive. Read more Wenxin 5.0 Launch Marks a Milestone in AI Outpacing OpenAI’s GPT-5.1 in Key Benchmark
The WAN 2.2 Foundation Why Its the Perfect Bedrock for Ovi
To appreciate Ovi youve got to circle back to WAN 2.2. Launched mid 2025 this open weights model quickly became a darling for its balance of power and accessibility.
medium.com At 5B parameters its lighter than behemoths like Sora yet punches above with superior motion coherence and detail retention.
In benchmarks WAN 2.2 edges out competitors in key areas. Heres a quick comparison table based on community tests and arXiv evals:
| Model | Motion Coherence (1-10) | Detail Fidelity (1-10) | VRAM (GB for 10s clip) | Open Source? |
|---|---|---|---|---|
| WAN 2.2 | 9.2 | 8.7 | 24-32 | Yes |
| Kling 2.1 | 8.5 | 9.0 | Cloud only | No |
| Veo 3 | 9.5 | 9.2 | Cloud only | No |
| Seedance 1.0 | 8.0 | 8.3 | 40+ | Partial |
Data pulled from Medium showdowns and deep dives. medium.com +1 Notice WANs VRAM sweet spot? Thats crucial for local runs in ComfyUI where cloud costs can stack up fast.
Ovis innovation? It doesnt just bolt audio onto WAN. The fusion happens at the latent level sharing embeddings for tighter sync. Early adopters on Reddit noted lips matching 95% better than patched workflows like InfiniteTalk on WAN.
reddit.com From my perspective tweaking prompts revealed WANs strength in handling ambiguity. A vague “excited storyteller” cue produced varied gestures across seeds from
wide eyed wonder to animated hands without prompt engineering overload.

Getting Ovi Up and Running in ComfyUI A Painless Setup Guide
ComfyUI fans rejoice. Ovi slots in seamlessly thanks to Kijais WanVideoWrapper a custom node pack that wraps the complexity in intuitive nodes.
github.com No more wrangling raw PyTorch scripts. Heres my streamlined setup born from a few false starts.First snag the wrapper. In your ComfyUI custom_
nodes folder git clone https://github.com/kijai/ComfyUI-WanVideoWrapper. Fire up pip install r requirements.txt.
Restart ComfyUI and youll see fresh nodes like Wan Ovi Loader and Audio Video Fusion.Next models. Head to Hugging Face Kijai WanVideo_comfy Ovi branch. Grab these essentials:
- Video model: ovi_960x960_10s.safetensors (about 10GB) for the core diffusion.
- Audio components: mmaudio_vae.safetensors and vocoder_netG.safetensors tucked into a subfolder.
- VAE: wan_vae_bf16.safetensors the memory friendly variant.
Drop video and VAE into ComfyUI models diffusion_models. Audio bits go to a new models mmaudio folder. Pro tip: Opt for BF16 over FP32 if your rigs under 32GB VRAM it shaves
off 20% usage with negligible quality dip.With files in place refresh ComfyUI. Load a basic workflow json from the wrappers examples or build from scratch. Core nodes:
- Prompt Encoder: T5 XL for text CLIP for vision if using images.
- Ovi Fusion Node: Links your <S> wrapped speech and Audio: descriptors.
- Sampler: UniPC at 50 steps CFG 7.0. I swapped SDPA for Sage attention early on it halved my gen times from 20 to 10 minutes on a 4090.
- VAE Decode: Ties it all into MP4 output.
Prompt rules matter. Start simple: A young woman with long braids <S>Thats five extra seconds to be dramatic<E> Audio: faint city hum. For images upscale to 960×960 first
via ESRGAN node to avoid artifacts.In one session I goofed by skipping the VAE connect. Result? Silent video with ghostly lips. Double check those wires folks. Once dialed in expect 10 18 minute renders for 10s clips. With easy cache enabled drop to under 10. Read more China’s Kimi K2 Shocks AI World: Beats GPT-5 for Just $4.6 Million
Hands On Examples That Sparked My Creativity
Theory is fine but seeing Ovi in action seals the deal. Drawing from the tutorial video that introduced me to it I recreated a few scenes tweaking for my style.First text to video. Prompt: A woman with wavy blonde hair in a sunlit garden <S>I spent so much time perfecting this world in five seconds that
I never asked why<E> Audio: birds chirping softly. At 50 steps UniPC the output clocked in at nine seconds by accident but the lip sync was spot on. Her brows furrowed just right on “never asked” with garden breeze rustling leaves in sync. Render time: 12 minutes. Compared to my old AnimateDiff
setups this felt worlds smoother no drift in facial features.Switching to image to video I pulled a stock photo of a braided character. Prompt: Young woman
with intricate braids overlooking a misty valley <S>Hey yo dont even worry<E> Audio: echoing winds. Settings mirrored the text run but I bumped guidance to
4.5 for punchier expressions. Boom. Ten seconds of her turning head with braids swaying audio fading into valley calls. The consistency blew me away no morphing glitches mid clip.One personal twist: I chained Ovi with a LoRA for cyberpunk vibes. Input a neon lit portrait
typed <S>Obie gave me ten and Im using all of them to be wrong again<E> Audio: synth pulses. The result? A glitchy confessional that screamed short film potential.
It took three seeds to nail the eye glint but that iteration joy is ComfyUIs secret sauce.For multi speaker dreams Ovi shines in loops. Generate base clips then extend
with overlap nodes. My experiment: Two characters debating ethics. Sync hit 92% on playback lips and pauses aligned like pros. Read more Kimi K2 Thinking & Gynoids Lead AI News Realtime Videos & Space GPUs
Unique Insights from the Trenches Personal Tweaks and Pitfalls
Beyond basics heres where Ovi gets personal. In weeks of play I uncovered nuggets the docs skim.Insight one: Prompt temperature. Default 1.0 works but crank to 1.2 for wilder expressions. A tame “happy chef” became a flamboyant one tossing spices with jazz hands. Downside?
Occasional overacting so balance with negative prompts like “exaggerated gestures”.Pitfall alert: Attention modes. SDPA is default but Sage or FlashAttention3 slash times 30% on
RTX cards. I stuck with SDPA initially got 25 minute waits. Flip to Sage and youre golden. Also UniPC sampler over Euler edges out in fidelity my side by side tests showed crisper edges.VRAM hacks: FP8 quantization via the wrappers flag drops needs to 24GB. Quality dips 5% on fine details
but for drafts its gold. Pair with CPU offload for 16GB rigs though expect 50% slower speeds.Fresh perspective: Ovi as storyteller aid. I used it for script prototyping. Type dialogue image actors
generate rough cuts. Spot pacing issues visually like rushed lines via stiff sync. Faster than storyboarding by hand and infinitely more engaging.Community buzz on
X echoes this. Users rave about infinite loop potential for longer vids one thread detailed chaining 10s clips with 2s overlaps for minute long talks.
@zasuko_michiksa My take? Its not just tool its muse pushing me to write snappier dialogue knowing the model will emote it right. Read more Nvidia GB300 Chip Teams Up with Samsung Tech for Smarter AI
Comparisons Head to Head with the Big Players
Ovi doesnt exist in vacuum. How does it stack against Veo 3 Kling or even patched WAN setups? Short answer: It holds its own especially locally.Veo 3 wins on polish longer clips
up to 60s and 4K but youre locked to Google cloud at $0.05 per second.
Ovi? Free post download 10s bursts that loop seamlessly. Lip sync? Ovis fused approach trumps Veos post process dubs community evals peg it 15% tighter.
arxiv.org Kling 2.1 dazzles in dynamics think explosive actions but audio is afterthought. Ovis baked in sound means no ElevenLabs detours saving hours. Benchmarks show
WAN base edging Kling in coherence 8.5 vs 8.0.
medium.com Versus raw WAN 2.2 Ovi adds 20% value in expressiveness. A silent WAN clip feels static Ovi infuses soul. Cost wise Ovi wins hands down no API fees just your GPU humming.Table for clarity:
| Aspect | Ovi (WAN 2.2 5B) | Veo 3 | Kling 2.1 |
|---|---|---|---|
| Audio Sync | Native 95% | Post 85% | Add on 80% |
| Local Run | Yes 24GB VRAM | No | Partial |
| Cost per 10s | Free | $0.50 | $0.30 |
| Length Max | 10s (extendable) | 60s | 30s |
Sourced from arXiv and battles. arxiv.org +1Bottom line: For indie creators Ovi democratizes pro level output. Big studios may stick to Veo but your next viral reel? Ovi territory.
Pro Tips to Elevate Your Ovi Outputs
Want god tier results? These tweaks from my notebook:
- Layer Audio Smartly: Beyond <S> tags weave descriptors like Audio: tense violin swell on climax. Builds emotional arcs without extra nodes.
- Seed Hunting: Fix seeds for consistency across batches. I batch 4 8 seeds tweak prompts iteratively.
- Post Process Polish: Pipe Ovi MP4s into Upscayl for 4K bump or Adobe for color grade. Keeps workflow lean.
- Ethical Guardrails: Ovis realism tempts deepfakes. Watermark outputs and disclose AI use transparency builds trust.
- Extend Hacks: Overlap last 2 frames of one clip with first of next. Stitch in FFmpeg node for 30s narratives.
One wildcard: Integrate with ControlNet for pose guidance. Animate a dancing avatar with typed lyrics sync hits new heights.
Wrapping Up Why Ovi Feels Like the Future
Wan 2.2 5b Ovi in ComfyUI isnt just another model. Its a bridge from static ideas to breathing stories where words morph into motion and sound. From my late night experiments
to community shares its clear: This tool empowers without gatekeeping.Sure limitations linger 10s cap VRAM hunger but updates loom like 11B checkpoints and longer gens.
github.com For now its perfect for bite sized brilliance.Whats your take? Drop a comment with your wildest Ovi prompt or snag the workflow json from Kijais repo. Experiment share your clips and lets push this tech forward together. Your next creation awaits type it and watch it speak.














