Chapter 11: Text-to-Video Workflow with AI Prompt Generation¶
Video: Watch this chapter on YouTube (2:15:06)
Overview¶
This chapter extends the text-to-image workflow to video generation using Google's VO3 model through WaveSpeed AI. The workflow follows a similar pattern but with video-specific prompt engineering and longer processing times.
Detailed Summary¶
Differences from Text-to-Image¶
While structurally similar to text-to-image, video generation has key differences:
- Video prompt engineering: Different considerations than image prompts
- Longer processing times: Videos take more time to generate
- Higher costs: VO3 Fast costs ~$3.20 per generation
- Additional parameters: Duration, aspect ratio, audio generation
About Google VO3¶
VO3 (Video Object 3) is one of the highest-quality video generation models: - Produces high-fidelity videos with audio - Available in "Fast" and standard versions - Accessed through WaveSpeed AI or directly via Vertex AI - Premium pricing reflects quality
Workflow Architecture¶
Step 1: Pre-configured Trigger and Agent¶
In the labs, the first two nodes are pre-configured:
- Chat Trigger: Same as before
- Video Prompt Agent: Similar to image agent but video-focused
Video System Prompt Differences¶
The system prompt is modified for video:
You are an expert text-to-video generation prompt engineer working inside an n8n automation.
Your task is to generate clear, vivid, and effective prompts for video generation models.
Guidelines:
- Describe the scene, action, and movement
- Specify camera angles and motion (pan, zoom, tracking)
- Include temporal elements (what happens over time)
- Define style and mood
- Mention lighting and atmosphere
- Keep prompts detailed but focused
- Output ONLY the prompt
Step 2: WaveSpeed API Configuration for VO3¶
Selecting VO3 Model¶
- Go to WaveSpeed → Explore Models
- Search: "VO3"
- Filter: "text-to-video"
- Choose: "VO3 Fast" (faster, more affordable)
Note on pricing: VO3 is expensive (~$3.20 per request). Consider alternatives like Seedance or Kling for lower costs.
HTTP Request Setup¶
- Add HTTP Request node
- Rename: "WaveSpeed POST"
- Copy cURL from VO3 API documentation
- Import cURL
Authentication¶
Use the same WaveSpeed credential from previous chapter: - Generic Credential Type - Header Auth - Previously created "WaveSpeed Credential Demo"
Body Configuration Issue¶
When importing VO3 cURL, parameters may not map correctly:
Solution: Use raw JSON body
- Change body type to JSON
- Paste the JSON body directly from documentation
- Replace the prompt with the expression from Video Prompt Agent
Example JSON Body¶
{
"prompt": "{{ $node['Video Prompt Agent'].json.content }}",
"aspect_ratio": "16:9",
"duration": 8,
"generate_audio": true
}
Note: Duration must be 8 seconds for text-to-video (5 seconds not supported).
Step 3: Test with Sample Prompt¶
Example user input: "Create a video of five gorillas on a boat having a great fishing trip"
Video Prompt Agent output: "A lively cinematic scene of five gorillas on a wooden fishing boat in the middle of a sunlit lake. Laughing, cheering as they reel in big thrashing fish, splashes of water in golden sunlight..."
Step 4: Wait Node¶
- Add Wait node
- Set: 15 seconds (minimum—may need longer)
- Video generation takes more time than images
Important: Pin the POST data to save API costs during testing.
Step 5: GET Request for Video Result¶
- Add HTTP Request node
- Rename: "WaveSpeed GET"
- Import GET cURL from documentation
- Configure URL with dynamic request ID
- Use same authentication credential
- Toggle off manual headers
Step 6: If Loop¶
Same pattern as text-to-image:
- Add If node
- Condition:
statusequalscompleted - True branch: Continue to output
- False branch: Wait 15 seconds → Loop back to GET
Video generation often requires multiple polling attempts due to longer processing.
Step 7: Gmail Output¶
Configure similarly to text-to-image: 1. Add Gmail node to True branch 2. Subject: "Video generated on [timestamp]" 3. Message: Drag video output URL 4. Disable n8n attribution
Testing the Complete Workflow¶
- Execute entire workflow
- Wait for video processing
- Check email for video link
- Download and review video
Sample Output Review¶
Generated video typically includes: - The scene described in the prompt - Motion and action specified - Audio (if enabled) - Style matching prompt description
Quality depends on prompt specificity—vague prompts produce vague results.
Alternative Video Models¶
VO3 is premium; consider alternatives on WaveSpeed:
| Model | Cost | Quality | Speed |
|---|---|---|---|
| VO3 | $$$ | Highest | Moderate |
| VO3 Fast | $$ | High | Fast |
| Seedance | $ | Good | Fast |
| Kling | $ | Good | Moderate |
| Wand | $ | Good | Fast |
Improving Video Quality¶
Tips for better results: 1. Be specific in prompts: Include style, mood, camera movement 2. Iterate on system prompts: Refine the prompt engineer instructions 3. Test different models: Each model has strengths 4. Consider aspect ratio: Match intended use (social media, presentations) 5. Enable audio thoughtfully: Not always needed
Key Takeaways¶
-
Similar architecture to images: Same pattern applies—POST, wait, GET, check, deliver.
-
Video prompts need motion: Include action, camera movement, temporal progression.
-
VO3 is premium pricing: Consider alternatives for cost-sensitive projects.
-
Longer wait times required: Videos take more processing time than images.
-
8-second minimum for VO3: Duration requirements vary by model.
-
Pin POST data religiously: Video generation costs add up quickly.
-
Raw JSON sometimes needed: When cURL import fails, paste JSON body directly.
-
Audio is optional: Enable only when needed.
-
If loops are essential: Video processing time varies significantly.
-
System prompts differ: Video needs temporal and motion-specific instructions.
Conclusion¶
Text-to-video generation extends the patterns learned in text-to-image with video-specific considerations. The higher costs and longer processing times make efficient workflow design more critical—pinning data and implementing proper polling loops prevent wasted resources. Google's VO3 represents cutting-edge video generation, but alternatives exist for budget-conscious projects. The prompt engineering aspect becomes even more important for video, where temporal elements and camera motion significantly impact quality. This workflow serves as a foundation for the image-to-video workflow in the next chapter, where an existing image becomes the starting point for video generation.