qwen-tool

This model is a fine-tuned version of Qwen/Qwen3-4B, optimized for complex functional calling and multi-step tool use via the Model Context Protocol (MCP).

The model was aligned using Proximal Policy Optimization (PPO) in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states.

Model Details

Developed by: Igriscodes
Base Model: Qwen/Qwen3-4B
License: Mozilla Public License 2.0 (MPL 2.0)
Training Framework: Hugging Face trl & peft (LoRA)
Alignment Method: PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance

Intended Uses & Limitations

Intended Use Cases

Structured Tool Calling: Interfacing natively with Model Context Protocol (MCP) servers.
Multi-step Agentic Tasks: Iterative problem-solving across math, web searching, database queries, and data processing.
Error-Resilient Agents: Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions.

Training Architecture & Alignment Loop

The model was trained as the Policy (Actor) within a custom gymnasium environment (MCPGymEnv). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server.

Reward Specification Matrix

The PPO agent was optimized against a dense, feedback-driven execution reward model:

Trigger Status	Reward	Evaluation Logic
Success	`+10.0`	Tool executed cleanly; returned data matches the expected task state.
Tool Execution	`0.0`	Tool ran successfully, but the overarching objective is incomplete.
Tool Error	`-0.5`	Target tool was hit, but threw a runtime exception (e.g., bad arguments).
Invalid JSON	`-0.8`	Failed to output a syntactically valid JSON tool-call schema.
Structural Fail	`-1.0`	Severe divergence from agentic system instructions or tool hallucination.

Hyperparameters & Efficiency Stack

Quantization: 4-bit NormalFloat (NF4) via bitsandbytes (for base model loading).
PEFT Adaptation: LoRA targeted all linear layers (q_proj, v_proj, k_proj, o_proj, etc.).
Memory Optimization: 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint.

Acknowledgements

We express our gratitude to the following organizations, communities, and tools that made this project possible:

Qwen (Alibaba Cloud) - For providing the foundational Qwen3 model weights and architecture.
Hugging Face - For the incredible ecosystem and libraries used to load, manage, and train the model.
PyTorch - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning.
Google Gemini 3 - For providing assistance in optimizing, and debugging the fine-tuning code scripts.

License

Mozilla Public License Version 2.0 - Feel free to use and modify

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

F16

Model tree for Igriscodes/qwen3-4b-tool

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(1021)

this model

Quantizations

1 model