{"type":"rich","version":"1.0","provider_name":"Transistor","provider_url":"https://transistor.fm","author_name":"Daily Paper Cast","title":"CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models","html":"<iframe width=\"100%\" height=\"180\" frameborder=\"no\" scrolling=\"no\" seamless src=\"https://share.transistor.fm/e/855e569e\"></iframe>","width":"100%","height":180,"duration":1526,"description":"\n            🤗 Upvotes: 54 | cs.CV\n\n            Authors:\n            Joowon Kim, Seungho Shin, Joonhyung Park, Eunho Yang\n\n            Title:\n            CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models\n\n            Arxiv:\n            http://arxiv.org/abs/2605.08735v1\n\n            Abstract:\n            Recent \"Thinking with Video\" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier's diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.\n            ","thumbnail_url":"https://img.transistorcdn.com/8lOVNnuwhrA3rxrDMv7Osu4j_t1-jORooO6NfGcQhcw/rs:fill:0:0:1/w:400/h:400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS81Zjg1/YzRhODczMDU4MmE4/OGMwN2FiNDlmYzI2/MDliMi5qcGVn.webp","thumbnail_width":300,"thumbnail_height":300}