CVPR 2026 Findings

Video2LoRA

Unified semantic-controlled video generation via per-reference-video LoRA.

Video2LoRA predicts a dedicated semantic LoRA from a reference video, making controllable video generation more flexible, compact, and zero-shot friendly.

<50KB LoRA parameters per semantic condition
<150MB Final model size
Zero-shot Generalization to unseen semantics
Video2LoRA teaser
Why it matters

A more flexible route to video control

Semantic alignment across diverse video generation conditions is hard. Explicit structural guidance can become rigid, while per-condition models are difficult to reuse. Video2LoRA treats the reference video as the semantic source and predicts compact LoRA weights that adapt a frozen diffusion backbone.

Interactive System Map

Reference Video

A short reference clip provides the semantic cue: motion, style, camera behavior, or visual effect.

Video2LoRA framework diagram
01

LightLoRA Representation

Compact LoRA factors reduce per-condition parameter footprint while keeping semantic adaptability.

02

HyperNetwork Prediction

A transformer-based HyperNetwork consumes reference-video features and predicts semantic-specific LoRA weights.

03

Frozen Diffusion Backbone

The predicted adapters plug into a frozen video diffusion model, avoiding per-condition fine-tuning.

Citation

BibTeX

BibTeX
@misc{wu2026video2loraunifiedsemanticcontrolledvideo,
title={Video2LoRA: Unified Semantic-Controlled Video Generation via Per-Reference-Video LoRA},
author={Wu, Zexi and Li, Baolu and Dai, Jing and Zhang, Yiming and Ma, Yue and Wang, Qinghe and Jia, Xu and Xu, Hongming},
year={2026},
eprint={2603.08210},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08210}
}