1920×1080 video from a consumer GPU, zero cloud.
Local T2V/I2V pipeline on RTX 3070 Laptop 8GB — Sulphur 2 22B GGUF with LTX-2.3 joint AV denoising, LCM refine pass, and Real-ESRGAN upscaling. Three stages, designed around the 8 GB VRAM budget.
Sulphur v1 — Demo Clip
Sulphur v1 — 32-second local demo on RTX 3070 Laptop 8 GB. Sulphur 2 22B GGUF + LTX-2.3 joint AV denoising, three-stage pipeline (draft → LCM refine → Real-ESRGAN) to 1920×1080. Zero cloud compute.
Screen-reader description: short animated sequence showing anime-style characters and environments generated by the Sulphur 2 local AI video pipeline. The clip loops continuously at 1920×1080.
Pipeline flow: a text prompt is expanded by Gemma 3 12B into cinematic language, then fed to Pass 1 Draft (Sulphur 2 22B GGUF, 512×288, LTX joint AV). Pass 2 LCM Refine upsamples to 1024×576 near the VRAM ceiling. Real-ESRGAN ×2 delivers the final 1920×1080 H.264 MP4. The I2V feedback arrow anchors continuous multi-segment video: each segment's last frame becomes the first-frame anchor of the next.
Key Numbers
Technical Highlights
8 GB VRAM is unevenly distributed across the pipeline: Pass 1 uses ~5.8 GB (~2 GB headroom), Pass 2 saturates at ~7.8 GB, and Real-ESRGAN runs on RAM after the 22B model is evicted. The design implication is precise — any modification that expands Pass 2 attention will OOM. Future additions (IPAdapter, ControlNet) can only attach to Pass 1 where headroom exists.
The distilled LoRA was trained at CFG=1.0. Any higher CFG value pushes guidance off the training distribution, producing inter-frame flicker. After locking CFG=1.0, PSNR-Y improved by +10.43 dB. The bonus: at CFG=1 each denoising step runs a single forward pass instead of two, cutting sampling cost roughly in half compared to classifier-free guidance.
The ×4 upscale path generates ~25 GB of intermediate buffers, OOM-killing even a
32 GB RAM configuration. Switching to RealESRGAN_x2plus.pth delivers
1920×1080 directly from the 1024×576 LCM output with no tiling required.
The constraint is system RAM, not VRAM — an important distinction when planning
higher-resolution outputs.
A custom PSNR-Y analysis tool computes per-frame luminance SNR, reports statistics, and flags the worst inter-frame transitions. Every hypothesis during the CFG alignment debug session — including one false lead — was validated or rejected through this quantitative signal rather than visual inspection alone. Treating quality as a measurable metric made the investigation reproducible and conclusive.