// AI Video Generation

1920×1080 video from a consumer GPU, zero cloud.

Local T2V/I2V pipeline on RTX 3070 Laptop 8GB — Sulphur 2 22B GGUF with LTX-2.3 joint AV denoising, LCM refine pass, and Real-ESRGAN upscaling. Three stages, designed around the 8 GB VRAM budget.

Sulphur v1 — Demo Clip

Sulphur v1 — 32-second local demo on RTX 3070 Laptop 8 GB. Sulphur 2 22B GGUF + LTX-2.3 joint AV denoising, three-stage pipeline (draft → LCM refine → Real-ESRGAN) to 1920×1080. Zero cloud compute.

Pipeline flow: a text prompt is expanded by Gemma 3 12B into cinematic language, then fed to Pass 1 Draft (Sulphur 2 22B GGUF, 512×288, LTX joint AV). Pass 2 LCM Refine upsamples to 1024×576 near the VRAM ceiling. Real-ESRGAN ×2 delivers the final 1920×1080 H.264 MP4. The I2V feedback arrow anchors continuous multi-segment video: each segment's last frame becomes the first-frame anchor of the next.

Key Numbers

RTX 3070

Laptop 8 GB — consumer GPU, zero cloud

22B GGUF

Sulphur 2 Q3_K_M — ~12 GB quantized, CPU+GPU offload

3 stages

Draft / Refine / Upscale — designed around 8 GB VRAM

+10.43 dB

PSNR-Y gain after CFG alignment (21.79 → 32.22)

1920×1080

H.264 MP4 + LTX joint AV ambient audio

~10 min

Per segment, warm cache — full three-stage run

Technical Highlights

VRAM asymmetry forces the three-stage architecture

8 GB VRAM is unevenly distributed across the pipeline: Pass 1 uses ~5.8 GB (~2 GB headroom), Pass 2 saturates at ~7.8 GB, and Real-ESRGAN runs on RAM after the 22B model is evicted. The design implication is precise — any modification that expands Pass 2 attention will OOM. Future additions (IPAdapter, ControlNet) can only attach to Pass 1 where headroom exists.

CFG=1.0 is both required and faster

The distilled LoRA was trained at CFG=1.0. Any higher CFG value pushes guidance off the training distribution, producing inter-frame flicker. After locking CFG=1.0, PSNR-Y improved by +10.43 dB. The bonus: at CFG=1 each denoising step runs a single forward pass instead of two, cutting sampling cost roughly in half compared to classifier-free guidance.

ESRGAN ×2 not ×4 — constrained by RAM, not VRAM

The ×4 upscale path generates ~25 GB of intermediate buffers, OOM-killing even a 32 GB RAM configuration. Switching to RealESRGAN_x2plus.pth delivers 1920×1080 directly from the 1024×576 LCM output with no tiling required. The constraint is system RAM, not VRAM — an important distinction when planning higher-resolution outputs.

Quantifying flicker — turning “looks wrong” into a number

A custom PSNR-Y analysis tool computes per-frame luminance SNR, reports statistics, and flags the worst inter-frame transitions. Every hypothesis during the CFG alignment debug session — including one false lead — was validated or rejected through this quantitative signal rather than visual inspection alone. Treating quality as a measurable metric made the investigation reproducible and conclusive.

Want the full technical write-up?

Three-way parameter alignment table, CLI flow, PSNR methodology, tuning history, troubleshooting — all the non-obvious decisions documented end-to-end.

Read the deep dive →