qwen-tool

This model is a fine-tuned version of Qwen/Qwen3-4B, optimized for complex functional calling and multi-step tool use via the Model Context Protocol (MCP).

The model was aligned using Proximal Policy Optimization (PPO) in a closed-loop agentic environment. It leverages execution-based feedback from an MCP server to drastically reduce tool hallucinations, adhere to strict JSON formatting, and self-correct based on execution error states.

Model Details

  • Developed by: Igriscodes
  • Base Model: Qwen/Qwen3-4B
  • License: Mozilla Public License 2.0 (MPL 2.0)
  • Training Framework: Hugging Face trl & peft (LoRA)
  • Alignment Method: PPO (Proximal Policy Optimization) with Execution-Based Reward Guidance

Intended Uses & Limitations

Intended Use Cases

  • Structured Tool Calling: Interfacing natively with Model Context Protocol (MCP) servers.
  • Multi-step Agentic Tasks: Iterative problem-solving across math, web searching, database queries, and data processing.
  • Error-Resilient Agents: Handling tool-execution errors gracefully by rewriting payload schemas based on environment exceptions.

Training Architecture & Alignment Loop

The model was trained as the Policy (Actor) within a custom gymnasium environment (MCPGymEnv). The environment tracks an execution loop between the model's textual outputs and a backend mock MCP server.

Reward Specification Matrix

The PPO agent was optimized against a dense, feedback-driven execution reward model:

Trigger Status Reward Evaluation Logic
Success +10.0 Tool executed cleanly; returned data matches the expected task state.
Tool Execution 0.0 Tool ran successfully, but the overarching objective is incomplete.
Tool Error -0.5 Target tool was hit, but threw a runtime exception (e.g., bad arguments).
Invalid JSON -0.8 Failed to output a syntactically valid JSON tool-call schema.
Structural Fail -1.0 Severe divergence from agentic system instructions or tool hallucination.

Hyperparameters & Efficiency Stack

  • Quantization: 4-bit NormalFloat (NF4) via bitsandbytes (for base model loading).
  • PEFT Adaptation: LoRA targeted all linear layers (q_proj, v_proj, k_proj, o_proj, etc.).
  • Memory Optimization: 8-bit Paged AdamW optimizer, gradient checkpointing, and parallel rollout sampling to balance the Actor-Critic-Reference model triplet footprint.

Acknowledgements

We express our gratitude to the following organizations, communities, and tools that made this project possible:

  • Qwen (Alibaba Cloud) - For providing the foundational Qwen3 model weights and architecture.
  • Hugging Face - For the incredible ecosystem and libraries used to load, manage, and train the model.
  • PyTorch - For the robust, deep learning framework that powered the underlying tensor computations and GPU acceleration during fine-tuning.
  • Google Gemini 3 - For providing assistance in optimizing, and debugging the fine-tuning code scripts.

License

Mozilla Public License Version 2.0 - Feel free to use and modify

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Igriscodes/qwen3-4b-tool

Finetuned
Qwen/Qwen3-4B
Adapter
(1021)
this model
Quantizations
1 model