ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
Paper
•
2504.07981
•
Published
•
2
This model is convert by mlx_vlm from ByteDance-Seed/UI-TARS-1.5-7B
UI-TARS-1.5 is ByteDance's open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
The released UI-TARS-1.5-7B focuses primarily on enhancing general computer use capabilities and is not specifically optimized for game-based scenarios, where the UI-TARS-1.5 still holds a significant advantage.
| Benchmark Type | Benchmark | UI-TARS-1.5-7B | UI-TARS-1.5 |
|---|---|---|---|
| Computer Use | OSWorld | 27.5 | 42.5 |
| GUI Grounding | ScreenSpotPro | 49.6 | 61.6 |
P.S. This is the performance of UI-TARS-1.5-7B and UI-TARS-1.5 on OSWorld and ScreenSpotProd.
mlx_vlm.generate --model flin775/UI-Tars-1.5-7B-4bit-mlx \
--max-tokens 1024 \
--temperature 0.0 \
--prompt "List all contacts’ names and their corresponding grounding boxes([x1, y1, x2, y2]) from the left sidebar of the IM chat interface, return the results in JSON format." \
--image https://wechat.qpic.cn/uploads/2016/05/WeChat-Windows-2.11.jpg
Base model
ByteDance-Seed/UI-TARS-1.5-7B