Skip to content

Chapter 11: Text-to-Video Workflow with AI Prompt Generation

Video: Watch this chapter on YouTube (2:15:06)

Overview

This chapter extends the text-to-image workflow to video generation using Google's VO3 model through WaveSpeed AI. The workflow follows a similar pattern but with video-specific prompt engineering and longer processing times.

Detailed Summary

Differences from Text-to-Image

While structurally similar to text-to-image, video generation has key differences:

  1. Video prompt engineering: Different considerations than image prompts
  2. Longer processing times: Videos take more time to generate
  3. Higher costs: VO3 Fast costs ~$3.20 per generation
  4. Additional parameters: Duration, aspect ratio, audio generation

About Google VO3

VO3 (Video Object 3) is one of the highest-quality video generation models: - Produces high-fidelity videos with audio - Available in "Fast" and standard versions - Accessed through WaveSpeed AI or directly via Vertex AI - Premium pricing reflects quality

Workflow Architecture

Chat Trigger → Video Prompt Agent → WaveSpeed POST → Wait → WaveSpeed GET → If Loop → Gmail

Step 1: Pre-configured Trigger and Agent

In the labs, the first two nodes are pre-configured:

  1. Chat Trigger: Same as before
  2. Video Prompt Agent: Similar to image agent but video-focused

Video System Prompt Differences

The system prompt is modified for video:

You are an expert text-to-video generation prompt engineer working inside an n8n automation.

Your task is to generate clear, vivid, and effective prompts for video generation models.

Guidelines:
- Describe the scene, action, and movement
- Specify camera angles and motion (pan, zoom, tracking)
- Include temporal elements (what happens over time)
- Define style and mood
- Mention lighting and atmosphere
- Keep prompts detailed but focused
- Output ONLY the prompt

Step 2: WaveSpeed API Configuration for VO3

Selecting VO3 Model

  1. Go to WaveSpeed → Explore Models
  2. Search: "VO3"
  3. Filter: "text-to-video"
  4. Choose: "VO3 Fast" (faster, more affordable)

Note on pricing: VO3 is expensive (~$3.20 per request). Consider alternatives like Seedance or Kling for lower costs.

HTTP Request Setup

  1. Add HTTP Request node
  2. Rename: "WaveSpeed POST"
  3. Copy cURL from VO3 API documentation
  4. Import cURL

Authentication

Use the same WaveSpeed credential from previous chapter: - Generic Credential Type - Header Auth - Previously created "WaveSpeed Credential Demo"

Body Configuration Issue

When importing VO3 cURL, parameters may not map correctly:

Solution: Use raw JSON body

  1. Change body type to JSON
  2. Paste the JSON body directly from documentation
  3. Replace the prompt with the expression from Video Prompt Agent

Example JSON Body

{
  "prompt": "{{ $node['Video Prompt Agent'].json.content }}",
  "aspect_ratio": "16:9",
  "duration": 8,
  "generate_audio": true
}

Note: Duration must be 8 seconds for text-to-video (5 seconds not supported).

Step 3: Test with Sample Prompt

Example user input: "Create a video of five gorillas on a boat having a great fishing trip"

Video Prompt Agent output: "A lively cinematic scene of five gorillas on a wooden fishing boat in the middle of a sunlit lake. Laughing, cheering as they reel in big thrashing fish, splashes of water in golden sunlight..."

Step 4: Wait Node

  1. Add Wait node
  2. Set: 15 seconds (minimum—may need longer)
  3. Video generation takes more time than images

Important: Pin the POST data to save API costs during testing.

Step 5: GET Request for Video Result

  1. Add HTTP Request node
  2. Rename: "WaveSpeed GET"
  3. Import GET cURL from documentation
  4. Configure URL with dynamic request ID
  5. Use same authentication credential
  6. Toggle off manual headers

Step 6: If Loop

Same pattern as text-to-image:

  1. Add If node
  2. Condition: status equals completed
  3. True branch: Continue to output
  4. False branch: Wait 15 seconds → Loop back to GET

Video generation often requires multiple polling attempts due to longer processing.

Step 7: Gmail Output

Configure similarly to text-to-image: 1. Add Gmail node to True branch 2. Subject: "Video generated on [timestamp]" 3. Message: Drag video output URL 4. Disable n8n attribution

Testing the Complete Workflow

  1. Execute entire workflow
  2. Wait for video processing
  3. Check email for video link
  4. Download and review video

Sample Output Review

Generated video typically includes: - The scene described in the prompt - Motion and action specified - Audio (if enabled) - Style matching prompt description

Quality depends on prompt specificity—vague prompts produce vague results.

Alternative Video Models

VO3 is premium; consider alternatives on WaveSpeed:

Model Cost Quality Speed
VO3 $$$ Highest Moderate
VO3 Fast $$ High Fast
Seedance $ Good Fast
Kling $ Good Moderate
Wand $ Good Fast

Improving Video Quality

Tips for better results: 1. Be specific in prompts: Include style, mood, camera movement 2. Iterate on system prompts: Refine the prompt engineer instructions 3. Test different models: Each model has strengths 4. Consider aspect ratio: Match intended use (social media, presentations) 5. Enable audio thoughtfully: Not always needed


Key Takeaways

  1. Similar architecture to images: Same pattern applies—POST, wait, GET, check, deliver.

  2. Video prompts need motion: Include action, camera movement, temporal progression.

  3. VO3 is premium pricing: Consider alternatives for cost-sensitive projects.

  4. Longer wait times required: Videos take more processing time than images.

  5. 8-second minimum for VO3: Duration requirements vary by model.

  6. Pin POST data religiously: Video generation costs add up quickly.

  7. Raw JSON sometimes needed: When cURL import fails, paste JSON body directly.

  8. Audio is optional: Enable only when needed.

  9. If loops are essential: Video processing time varies significantly.

  10. System prompts differ: Video needs temporal and motion-specific instructions.

Conclusion

Text-to-video generation extends the patterns learned in text-to-image with video-specific considerations. The higher costs and longer processing times make efficient workflow design more critical—pinning data and implementing proper polling loops prevent wasted resources. Google's VO3 represents cutting-edge video generation, but alternatives exist for budget-conscious projects. The prompt engineering aspect becomes even more important for video, where temporal elements and camera motion significantly impact quality. This workflow serves as a foundation for the image-to-video workflow in the next chapter, where an existing image becomes the starting point for video generation.