GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Abstract
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
Community
arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/gtr-turbo-merged-checkpoint-is-secretly-a-free-teacher-for-agentic-vlm-training-767-6669f6ff
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Stable and Efficient Single-Rollout RL for Multimodal Reasoning (2025)
- Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs (2025)
- RL-Struct: A Lightweight Reinforcement Learning Framework for Reliable Structured Output in LLMs (2025)
- SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models (2025)
- EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models (2025)
- EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards (2025)
- Each Prompt Matters: Scaling Reinforcement Learning Without Wasting Rollouts on Hundred-Billion-Scale MoE (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper






