Chapter 08 / 10

AI Video and Animation Prompting: The 5-Layer Prompt Stack That Stops the Scroll

If you don't know the specific prompting techniques for a model like Seedance 2.0, you'll generate slop every single time, regardless of how creative your idea is or how much you'r

HookAds Team·⏱ 10 min read

If you don't know the specific prompting techniques for a model like Seedance 2.0, you'll generate slop every single time, regardless of how creative your idea is or how much you're paying per generation.

The model has its own language for camera, lighting, motion, and constraints. Typing normal English descriptions into the prompt box is like speaking French to someone who only understands Japanese. You get an output, but it's the average of everything the model has seen, which is exactly the generic look that gets scrolled past.

The tools from the last chapter are only as good as the prompts you feed them. This chapter is the language: the exact 5-layer structure that turns a $0.60 generation into something that stops a scroll, the keyword library, the constraints that actually work, plus the animation and persona-building workflows that sit on top of it.

What you're actually working with

A modern video model like Seedance 2.0 is a multimodal film set, not a text-to-video box. In a single generation you can feed it up to 9 reference images (character sheets, mood boards, product photos), up to 3 video clips (camera motion, choreography, pacing), up to 3 audio tracks (voiceover, music, sound effects), plus your text prompt. That's 12 reference files processed at once, generating synchronized video and audio in a single pass, with lip-synced speech across multiple languages, at 4 to 15 seconds and up to 1080p.

If you're only typing text into the box, you're using about 15% of the tool while paying the same price as someone using all of it.

The 5-layer prompt stack

Community testing compressed the official formula into five layers that consistently beat longer, looser prompts. The order matters.

Subject > Action > Camera > Style > Constraints

Subject first pins the model to a center of gravity so it doesn't split attention. Action second gives the kinetic anchor. Camera third locks framing before the model re-decides the lens every few seconds. Style late adds flavor without hijacking motion. Constraints last close whatever gaps the other layers left open.

Layer 1, Subject. Specificity is load-bearing. "A woman" is bad. "A woman in her late 20s, tight dark curls at ear length, small silver hoop in left ear, fitted black turtleneck, neutral expression" is best. Every identity marker you provide is one the model doesn't hallucinate. One subject per generation is safest. Two work if you separate them spatially and tag them as @Character_A and @Character_B. Three or more is where it falls apart.

Layer 2, Action. One primary movement, present tense, written as a direction not a state. "She looks happy and is enjoying the sunset" gives the model a photograph to approximate. "She slowly turns toward the camera, breeze lifting the hem of her skirt, eyes narrowing against the light" gives it a sequence to execute. The rule almost nobody follows: separate subject movement from camera movement every time. "Spinning camera around a dancing person" confuses the model about who spins. "The dancer spins slowly, camera holds fixed framing" splits it into two clear directives and kills most of the shaky output people blame on the model.

Layer 3, Camera. One primary camera movement per generation. Describe rhythm (slow, smooth, gentle) rather than technical specs, because the model responds to descriptive language, not f-stops and ISO numbers.

Layer 4, Style. Lighting has the single biggest impact on video quality, bigger than style adjectives or resolution requests. If you add only one thing to a weak prompt, add a lighting description.

Layer 5, Constraints. The guardrails that separate AI-looking video from video that passes.

The keyword library

Camera movements: fixed / locked-off (no movement), push-in / dolly in (tension, emotional close-ups), pull-out / dolly out (reveals, context), pan left/right (scanning), tracking / follow (action), orbit / arc / 360 (product showcases, hero moments), aerial / drone (landscapes), handheld (documentary, UGC authenticity), crane up/down (dramatic height), gimbal (smooth cinematic), steadicam walk (following a character), whip pan (urgency, transitions), dolly zoom (the vertigo effect), rack focus (redirecting attention).

Speed modifiers: imperceptible / barely (almost unnoticeable), slow / gentle / gradual (the safest default), smooth / controlled (natural), dynamic / swift (high impact, use with caution). The word "fast" is the single most dangerous keyword: fast camera plus fast subject plus busy scene almost guarantees jitter. Make only one element fast and hold everything else steady. For compound movement, sequence it rather than stacking: "start: slow dolly-in, then: gentle pan right for the final 2 seconds."

Lighting (the highest-impact layer): golden hour (the single highest quality-per-word improvement), rim light / dramatic rim light (cinematic edge separation), soft key from 45 degrees (flattering talking-head light), overcast daylight (eliminates flicker), backlit silhouette at sunset (mood), motivated lighting from a practical source (realism), volumetric fog (atmospheric depth), chiaroscuro (high contrast).

Color grading: teal and orange (classic Hollywood), bleach bypass (gritty), warm/amber (nostalgic), crushed blacks (deep shadow), pastel (soft fashion or anime).

Film references as style anchors: "cinematic film tone, 35mm" (the most reliable all-purpose anchor), "16mm film, handheld" (raw indie), "anamorphic lens flare" (widescreen), "documentary-style handheld framing" (observational realism). "Cinematic" alone produces nothing predictable. Pair it with texture, lighting, or a film reference.

Constraints to append to every character prompt: avoid jitter, avoid bent limbs (use in every character prompt without exception), avoid identity drift, avoid temporal flicker, no distortion, maintain face consistency. A reliable quality suffix: "sharp clarity, natural colors, stable picture, no blur, no ghosting, no flickering." The model reads positive constraint statements ("avoid X," "maintain Y") more reliably than negative-prompt syntax.

Keywords that actively degrade output: "fast" unqualified (accelerates everything), "cinematic" alone (too vague), "epic" (no visual meaning), "amazing / beautiful / stunning" (feelings, not instructions), "lots of movement" (triggers jitter), "glow / glimmer / glints" (create specular flicker, use "steady intensity" or "diffuse" instead). The principle underneath all of these: if a word describes how the viewer should feel rather than what the camera should see, the model has to guess, and it guesses wrong.

Time-coded multi-shot prompting

You can direct individual shots inside a single 15-second generation by writing timestamps into the prompt. This is where the model becomes something genuinely different from every other tool.

plain text

[0-4s]: wide establishing shot, static camera, misty bamboo forest at dawn, golden hour light filtering through leaves
[4-9s]: medium shot, slow push-in, the fighter steps forward, white silk kimono billowing, determined expression
[9-15s]: close-up, orbit shot, the fighter strikes, slow motion, impact visible in the fabric ripple

Each shot specifies camera position, subject action, and lighting state. Transition language between shots ("hard cut to," "seamless morph into") gives explicit cut instructions. The universal escalation pattern maps straight onto a 15-second window: wide, then tighter, then tight, then closest. Establish the world, build tension, approach the emotional peak, then land the reveal.

The @ reference system

Operators getting outputs that don't read as AI are uploading 6 to 12 reference files and tagging every one with a specific role. An image without an @ tag gets processed ambiguously, and ambiguity produces averaging, which is the visual equivalent of mush.

The most underused shortcut is the first-last frame technique: upload your desired first frame as @Image1 and your desired last frame as @Image2, describe what happens between them, and the model interpolates coherent motion connecting the two endpoints, no storyboarding needed.

A full multimodal prompt tags each file's job:

plain text

@Image1 as character reference (maintain exact facial features and outfit)
@Image2 as environment reference (match lighting and color palette)
@Video1 for camera motion reference (replicate the slow orbit movement)
@Audio1 as background music (sync scene transitions to beat positions)

The iteration rule

Generate 2 to 3 baseline options, then change one variable: the camera, or the lighting, or the speed modifier. Score each for continuity and adherence, keep the best, change one more variable. The instinct after a failed generation is to rewrite the entire prompt at once, which means you can never isolate what helped and what hurt. Controlled iteration with one variable per pass is slower per cycle but converges faster. It's the same reason A/B testing beats redesigns.

AI animation ads: a parallel workflow

Animation ads (claymation, Pixar-style, paper cutout, Lego, retro cartoon) have been scaling hard, and a model like Gemini Omni makes them cheap. Each animation video costs roughly 15 credits. On a 25,000-credit plan, that's about 1,666 videos a month, which works out to around 12 cents per fully generated video. No editor fees, no three-day turnarounds, no revision loops.

There are two methods. Method 1 is faster: feed an animation-ad structure into Claude, give it your brand details, and let it output scene-by-scene prompts that drop straight into the animation model. Generate each scene, then stitch with voiceover (ElevenLabs), music, and captions (CapCut). Ten to fifteen minutes per finished ad. Method 2 gives more control: generate each starting frame manually in a stylized image model (Nano Banana Pro), animate the first scene, then take the last frame of that clip and use it as the starting frame for the next scene. Chaining the last frame into the next is the technique most people don't know, and it's how you get long, continuous animation with no jarring cuts or style shifts, because each scene literally starts where the previous one ended.

The whole appeal is that you can do any style. Just specify it clearly at the start and keep it consistent across every scene.

Building an AI persona

If you're running the same face across hundreds of ads, build a consistent persona once. The build is the same regardless of what you do with it.

The face. Generate the persona in a character-consistency tool (Higgsfield Soul ID or Nano Banana Pro). Aim for 30 to 90 source images across angles, expressions, and lighting, holding a 70/20/10 ratio: 70% on-brand looks, 20% expression variants, 10% wildcards.
The training set (optional). Going beyond about 100 posts, train a LoRA on the face to lock consistency across thousands of generations.
The voice. ElevenLabs voice cloning takes about 60 seconds of clean audio.
The motion. A video model (Seedance 2.0) animates the persona, combined with an image model for product placement and text rendering.
The script and automation. Claude for hook, body, and CTA: 10 hook variants per UGC ad, or DM and caption variants keyed to the platform. Overnight workflows can queue the whole batch while you're away, and you review the finished videos in the morning.

Total tool cost across all five steps: $50 to $150 a month. The build stays constant. What changes is what you post and where.

The checklist

Write every prompt in the 5-layer order: subject, action, camera, style, constraints
Make the subject specific — every identity marker you skip is one the model hallucinates
Separate subject movement from camera movement in every prompt to kill jitter
Lead your style layer with a lighting description — it's the highest-impact word you can add
Append the constraint suffix ("avoid jitter, avoid bent limbs, maintain face consistency, sharp clarity, stable picture") to every character generation
Cut the feeling-words — "epic," "cinematic" alone, "glow," "fast" unqualified all degrade output
Use time codes and @ references for multi-shot control, and chain the last frame into the next scene for continuous animation
Iterate one variable at a time — change the camera or the lighting, not the whole prompt

Next: [Organic Distribution Engines — How to Get Millions of Views With $0 in Ad Spend →](09-organic-distribution-engines-how-to-get-millions-of-views-with-0-in-ad-spend.md)

← PreviousThe AI UGC Production Stack: How to Produce Scroll-Stopping Ads at Scale Next →Organic Distribution Engines: How to Get Millions of Views With $0 in Ad Spend