InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
Abstract
InteractWeb-Bench presents the first multimodal interactive benchmark for website generation under non-expert low-code conditions, addressing semantic misalignment through diverse user agents and interactive execution environments.
With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.
Community
InteractWeb-Bench is a multimodal interactive benchmark for evaluating website generation agents under real-world, non-expert user conditions.
It simulates ambiguous, noisy, and conflicting user instructions through persona-driven user agents, and introduces a dynamic action space (Clarify, Implement, Verify, Submit) to assess agents’ ability to escape “blind execution” and align with user intent.
If you find our work interesting, we would really appreciate your support and upvote! 🌿🚀
InteractWeb-Bench is open-sourced at https://github.com/AIforIP/InteractWeb-Bench
Project page: https://interactweb-bench.wangqiyao.me/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HATS: Hardness-Aware Trajectory Synthesis for GUI Agents (2026)
- WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models (2026)
- GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents (2026)
- CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation (2026)
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification (2026)
- Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging (2026)
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.27419 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper