Instructions to use rimon-dutta/Rimon-Math-3B-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rimon-dutta/Rimon-Math-3B-V1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rimon-dutta/Rimon-Math-3B-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("rimon-dutta/Rimon-Math-3B-V1")
model = AutoModelForCausalLM.from_pretrained("rimon-dutta/Rimon-Math-3B-V1")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rimon-dutta/Rimon-Math-3B-V1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rimon-dutta/Rimon-Math-3B-V1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rimon-dutta/Rimon-Math-3B-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rimon-dutta/Rimon-Math-3B-V1

SGLang

How to use rimon-dutta/Rimon-Math-3B-V1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rimon-dutta/Rimon-Math-3B-V1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rimon-dutta/Rimon-Math-3B-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rimon-dutta/Rimon-Math-3B-V1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rimon-dutta/Rimon-Math-3B-V1",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use rimon-dutta/Rimon-Math-3B-V1 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rimon-dutta/Rimon-Math-3B-V1 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for rimon-dutta/Rimon-Math-3B-V1 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for rimon-dutta/Rimon-Math-3B-V1 to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="rimon-dutta/Rimon-Math-3B-V1",
    max_seq_length=2048,
)

Docker Model Runner
How to use rimon-dutta/Rimon-Math-3B-V1 with Docker Model Runner:
```
docker model run hf.co/rimon-dutta/Rimon-Math-3B-V1
```

Rimon-Math-3B-V1

Rimon-Math-3B-V1 is a specialized 3-billion-parameter causal language model, fine-tuned for high-accuracy mathematical reasoning and logical problem-solving. Built on the Llama-3.2-3B-Instruct architecture and optimized using the Unsloth framework, this model excels at generating structured, step-by-step solutions (Chain-of-Thought).

Highlights

Reasoning Focused: Trained specifically to break down complex problems into logical steps.
Lightweight & Efficient: Optimized for consumer-grade GPUs (T4, RTX 3060+) and edge deployment.
High Compatibility: Works seamlessly with transformers, vLLM, and supports GGUF conversion for local use.

Model Capabilities

The model is fine-tuned to handle various mathematical domains:

Algebra: Solving equations, inequalities, and system of equations.
Calculus: Derivatives, integrals, and limit problems.
Geometry & Trigonometry: Properties of shapes and trigonometric identities.
Logic & Arithmetic: Multi-step word problems and sequence analysis.

Training Metrics (Approximation)

Epoch	Step	Training Loss	Validation Loss	LR
1.0	1000	0.7104	0.6952	1.5e-4
2.0	2000	0.5911	0.5843	5.0e-5
3.0	3000	0.5244	0.5102	1.0e-5

Usage Guide

Installation & Dependencies

To run Rimon-Math-3B-V1 efficiently, ensure you have the latest versions of the following libraries installed. Run this command in your terminal or a notebook cell:

pip install -U transformers torch accelerate bitsandbytes sentencepiece

Component	Minimum (4-bit)	Recommended (16-bit)
GPU	NVIDIA T4 / RTX 3050 (4GB VRAM)	RTX 3060 / A100 (12GB+ VRAM)
RAM	8 GB System RAM	16 GB System RAM
CUDA	11.8 or higher	12.1 or higher

How to Use the Model

You can load the model in two different modes depending on your hardware resources.

Option 1: 4-bit Quantization (Low VRAM Mode)

Best for users on Google Colab (Free T4) or laptops with limited GPU memory. This uses only ~3.5 GB of VRAM.

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "rimon-dutta/Rimon-Math-3B-V1"

# 4-bit Configuration for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

Option 2: 16-bit Full Precision (High Accuracy Mode)

Best for users with 8GB+ VRAM (e.g., RTX 3060 12GB or higher). This provides the most precise mathematical reasoning.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "rimon-dutta/Rimon-Math-3B-V1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

Running Inference (Example)

Once the model is loaded, you can solve math problems using the standard Llama 3.2 chat template.

# Define your math problem
messages = [
    {"role": "system", "content": "You are a specialized math tutor. Explain step-by-step."},
    {"role": "user", "content": "If x + 1/x = 3, find the value of x^5 + 1/x^5."}
]

# Apply the chat template
inputs = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)

# Generate the response
outputs = model.generate(
    **inputs, 
    max_new_tokens=1024, 
    temperature=0.1, # Low temperature is crucial for math accuracy
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Troubleshooting Guide

GPU Memory Error (OOM): If you get an "Out of Memory" error, restart your runtime and use Option 1 (4-bit).
BitsAndBytes Issues: If load_in_4bit fails, ensure you are running on a Linux-based environment (or WSL2 on Windows) and that your bitsandbytes is up to date:

pip install -U bitsandbytes

CUDA Mismatch: If you encounter a runtime error regarding CUDA versions, reinstall PyTorch with the correct index URL:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Prompt Engineering Tips

Use a system prompt to control reasoning style Keep temperature between 0.1 – 0.3 for math tasks Always request step-by-step explanation Avoid ambiguous wording in problems

Author

Rimon Dutta
DevOps Engineer | AI & ML Learner
Kotwali, Bangladesh

Downloads last month: 12

Safetensors

Model size

3B params

Tensor type

F16

rimon-dutta
/

Rimon-Math-3B-V1