// AI Video Generation

1920×1080 video from a consumer GPU, zero cloud.

Local T2V/I2V pipeline on RTX 3070 Laptop 8GB — Sulphur 2 22B GGUF with LTX-2.3 joint AV denoising, LCM refine pass, and Real-ESRGAN upscaling. Three stages, designed around the 8 GB VRAM budget.

Sulphur v1 — Demo Clip

Sulphur v1 — 32-second local demo on RTX 3070 Laptop 8 GB. Sulphur 2 22B GGUF + LTX-2.3 joint AV denoising, three-stage pipeline (draft → LCM refine → Real-ESRGAN) to 1920×1080. Zero cloud compute.

Screen-reader description: short animated sequence showing anime-style characters and environments generated by the Sulphur 2 local AI video pipeline. The clip loops continuously at 1920×1080.

AI video generation pipeline overview Six-stage pipeline: Raw Prompt feeds a Prompt Enhancer (Gemma 3 12B), which feeds Pass 1 Draft (Sulphur 2 22B, 512x288). Pass 1 feeds Pass 2 LCM Refine (LTXVUpsampler, 1024x576). Pass 2 feeds Real-ESRGAN upscaling to produce the final 1920x1080 MP4. An I2V feedback arrow runs from Pass 1 last-frame back to the next Pass 1 first-frame anchor for continuous narrative. Pass 1 also carries a Joint AV annotation indicating simultaneous video and audio latent generation. Raw Prompt user text → cinematic lang. Prompt Enhancer Gemma 3 12B Q3 llama.cpp ~10s DRAFT Pass 1 Sulphur 2 22B GGUF 512×288 · 8-step CFG=1.0 · ~300s ▶ Joint AV latent REFINE Pass 2 · LCM LTXVUpsampler ×2 1024×576 · 4-step LCM ~180s VRAM ~7.8 GB (sat.) UPSCALE Real-ESRGAN x2plus RRDBNet 1920×1080 · ~150s 1920×1080 MP4 H.264 + LTX joint AV I2V chain last→next frame 0

Pipeline flow: a text prompt is expanded by Gemma 3 12B into cinematic language, then fed to Pass 1 Draft (Sulphur 2 22B GGUF, 512×288, LTX joint AV). Pass 2 LCM Refine upsamples to 1024×576 near the VRAM ceiling. Real-ESRGAN ×2 delivers the final 1920×1080 H.264 MP4. The I2V feedback arrow anchors continuous multi-segment video: each segment's last frame becomes the first-frame anchor of the next.

Key Numbers

RTX 3070
Laptop 8 GB — consumer GPU, zero cloud
22B GGUF
Sulphur 2 Q3_K_M — ~12 GB quantized, CPU+GPU offload
3 stages
Draft / Refine / Upscale — designed around 8 GB VRAM
+10.43 dB
PSNR-Y gain after CFG alignment (21.79 → 32.22)
1920×1080
H.264 MP4 + LTX joint AV ambient audio
~10 min
Per segment, warm cache — full three-stage run

Technical Highlights

VRAM asymmetry forces the three-stage architecture

8 GB VRAM is unevenly distributed across the pipeline: Pass 1 uses ~5.8 GB (~2 GB headroom), Pass 2 saturates at ~7.8 GB, and Real-ESRGAN runs on RAM after the 22B model is evicted. The design implication is precise — any modification that expands Pass 2 attention will OOM. Future additions (IPAdapter, ControlNet) can only attach to Pass 1 where headroom exists.

CFG=1.0 is both required and faster

The distilled LoRA was trained at CFG=1.0. Any higher CFG value pushes guidance off the training distribution, producing inter-frame flicker. After locking CFG=1.0, PSNR-Y improved by +10.43 dB. The bonus: at CFG=1 each denoising step runs a single forward pass instead of two, cutting sampling cost roughly in half compared to classifier-free guidance.

ESRGAN ×2 not ×4 — constrained by RAM, not VRAM

The ×4 upscale path generates ~25 GB of intermediate buffers, OOM-killing even a 32 GB RAM configuration. Switching to RealESRGAN_x2plus.pth delivers 1920×1080 directly from the 1024×576 LCM output with no tiling required. The constraint is system RAM, not VRAM — an important distinction when planning higher-resolution outputs.

Quantifying flicker — turning “looks wrong” into a number

A custom PSNR-Y analysis tool computes per-frame luminance SNR, reports statistics, and flags the worst inter-frame transitions. Every hypothesis during the CFG alignment debug session — including one false lead — was validated or rejected through this quantitative signal rather than visual inspection alone. Treating quality as a measurable metric made the investigation reproducible and conclusive.

Want the full technical write-up?
Three-way parameter alignment table, CLI flow, PSNR methodology, tuning history, troubleshooting — all the non-obvious decisions documented end-to-end.
Read the deep dive →