Chapter 11: Text-to-Video Workflow with AI Prompt Generation¶

Video: Watch this chapter on YouTube (2:15:06)

Overview¶

This chapter extends the text-to-image workflow to video generation using Google's VO3 model through WaveSpeed AI. The workflow follows a similar pattern but with video-specific prompt engineering and longer processing times.

Detailed Summary¶

Differences from Text-to-Image¶

While structurally similar to text-to-image, video generation has key differences:

Video prompt engineering: Different considerations than image prompts
Longer processing times: Videos take more time to generate
Higher costs: VO3 Fast costs ~$3.20 per generation
Additional parameters: Duration, aspect ratio, audio generation

About Google VO3¶

VO3 (Video Object 3) is one of the highest-quality video generation models: - Produces high-fidelity videos with audio - Available in "Fast" and standard versions - Accessed through WaveSpeed AI or directly via Vertex AI - Premium pricing reflects quality

Workflow Architecture¶

Chat Trigger → Video Prompt Agent → WaveSpeed POST → Wait → WaveSpeed GET → If Loop → Gmail

Step 1: Pre-configured Trigger and Agent¶

In the labs, the first two nodes are pre-configured:

Chat Trigger: Same as before
Video Prompt Agent: Similar to image agent but video-focused

Video System Prompt Differences¶

The system prompt is modified for video:

You are an expert text-to-video generation prompt engineer working inside an n8n automation.

Your task is to generate clear, vivid, and effective prompts for video generation models.

Guidelines:
- Describe the scene, action, and movement
- Specify camera angles and motion (pan, zoom, tracking)
- Include temporal elements (what happens over time)
- Define style and mood
- Mention lighting and atmosphere
- Keep prompts detailed but focused
- Output ONLY the prompt

Step 2: WaveSpeed API Configuration for VO3¶

Selecting VO3 Model¶

Go to WaveSpeed → Explore Models
Search: "VO3"
Filter: "text-to-video"
Choose: "VO3 Fast" (faster, more affordable)

Note on pricing: VO3 is expensive (~$3.20 per request). Consider alternatives like Seedance or Kling for lower costs.

HTTP Request Setup¶

Add HTTP Request node
Rename: "WaveSpeed POST"
Copy cURL from VO3 API documentation
Import cURL

Authentication¶

Use the same WaveSpeed credential from previous chapter: - Generic Credential Type - Header Auth - Previously created "WaveSpeed Credential Demo"

Body Configuration Issue¶

When importing VO3 cURL, parameters may not map correctly:

Solution: Use raw JSON body

Change body type to JSON
Paste the JSON body directly from documentation
Replace the prompt with the expression from Video Prompt Agent

Example JSON Body¶

{
  "prompt": "{{ $node['Video Prompt Agent'].json.content }}",
  "aspect_ratio": "16:9",
  "duration": 8,
  "generate_audio": true
}

Note: Duration must be 8 seconds for text-to-video (5 seconds not supported).

Step 3: Test with Sample Prompt¶

Example user input: "Create a video of five gorillas on a boat having a great fishing trip"

Video Prompt Agent output: "A lively cinematic scene of five gorillas on a wooden fishing boat in the middle of a sunlit lake. Laughing, cheering as they reel in big thrashing fish, splashes of water in golden sunlight..."

Step 4: Wait Node¶

Add Wait node
Set: 15 seconds (minimum—may need longer)
Video generation takes more time than images

Important: Pin the POST data to save API costs during testing.

Step 5: GET Request for Video Result¶

Add HTTP Request node
Rename: "WaveSpeed GET"
Import GET cURL from documentation
Configure URL with dynamic request ID
Use same authentication credential
Toggle off manual headers

Step 6: If Loop¶

Same pattern as text-to-image:

Add If node
Condition: status equals completed
True branch: Continue to output
False branch: Wait 15 seconds → Loop back to GET

Video generation often requires multiple polling attempts due to longer processing.

Step 7: Gmail Output¶

Configure similarly to text-to-image: 1. Add Gmail node to True branch 2. Subject: "Video generated on [timestamp]" 3. Message: Drag video output URL 4. Disable n8n attribution

Testing the Complete Workflow¶

Execute entire workflow
Wait for video processing
Check email for video link
Download and review video

Sample Output Review¶

Generated video typically includes: - The scene described in the prompt - Motion and action specified - Audio (if enabled) - Style matching prompt description

Quality depends on prompt specificity—vague prompts produce vague results.

Alternative Video Models¶

VO3 is premium; consider alternatives on WaveSpeed:

Model	Cost	Quality	Speed
VO3	$$$	Highest	Moderate
VO3 Fast	$$	High	Fast
Seedance	$	Good	Fast
Kling	$	Good	Moderate
Wand	$	Good	Fast

Improving Video Quality¶

Tips for better results: 1. Be specific in prompts: Include style, mood, camera movement 2. Iterate on system prompts: Refine the prompt engineer instructions 3. Test different models: Each model has strengths 4. Consider aspect ratio: Match intended use (social media, presentations) 5. Enable audio thoughtfully: Not always needed

Key Takeaways¶

Similar architecture to images: Same pattern applies—POST, wait, GET, check, deliver.
Video prompts need motion: Include action, camera movement, temporal progression.
VO3 is premium pricing: Consider alternatives for cost-sensitive projects.
Longer wait times required: Videos take more processing time than images.
8-second minimum for VO3: Duration requirements vary by model.
Pin POST data religiously: Video generation costs add up quickly.
Raw JSON sometimes needed: When cURL import fails, paste JSON body directly.
Audio is optional: Enable only when needed.
If loops are essential: Video processing time varies significantly.
System prompts differ: Video needs temporal and motion-specific instructions.

Conclusion¶

Text-to-video generation extends the patterns learned in text-to-image with video-specific considerations. The higher costs and longer processing times make efficient workflow design more critical—pinning data and implementing proper polling loops prevent wasted resources. Google's VO3 represents cutting-edge video generation, but alternatives exist for budget-conscious projects. The prompt engineering aspect becomes even more important for video, where temporal elements and camera motion significantly impact quality. This workflow serves as a foundation for the image-to-video workflow in the next chapter, where an existing image becomes the starting point for video generation.