Instructions to use Tiiny/SmallThinker-21BA3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Tiiny/SmallThinker-21BA3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Tiiny/SmallThinker-21BA3B-Instruct")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Tiiny/SmallThinker-21BA3B-Instruct", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Tiiny/SmallThinker-21BA3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Tiiny/SmallThinker-21BA3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-21BA3B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Tiiny/SmallThinker-21BA3B-Instruct

SGLang

How to use Tiiny/SmallThinker-21BA3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Tiiny/SmallThinker-21BA3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-21BA3B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Tiiny/SmallThinker-21BA3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Tiiny/SmallThinker-21BA3B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Tiiny/SmallThinker-21BA3B-Instruct with Docker Model Runner:
```
docker model run hf.co/Tiiny/SmallThinker-21BA3B-Instruct
```

Improve model card: Add library, paper, GitHub links, and MoE tag

by nielsr HF Staff - opened Jul 29, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+22

-20

Files changed (1) hide show

README.md +22 -20

README.md CHANGED Viewed

@@ -1,14 +1,18 @@
 ---
-license: apache-2.0
 language:
 - en
 pipeline_tag: text-generation
 ---
 ## Introduction
 <p align="center">
-       &nbsp&nbsp🤗 <a href="https://huggingface.co/PowerInfer">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/PowerInfer">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://github.com/SJTU-IPADS/SmallThinker/blob/main/smallthinker-technical-report.pdf">Technical Report</a> &nbsp&nbsp
 </p>
 SmallThinker is a family of **on-device native** Mixture-of-Experts (MoE) language models specially designed for local deployment,
@@ -17,34 +21,32 @@ Designed from the ground up for resource-constrained environments,
 SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
 without relying on the cloud.
 ## Performance
 Note: The model is trained mainly on English.
-| Model                        | MMLU  | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
-|------------------------------|-------|--------------|----------|--------|-----------|-----------|---------|
-| **SmallThinker-21BA3B-Instruct** | 84.43 | <u>55.05</u> | 82.4     | **85.77** | **60.3**      | <u>89.63</u>     | **76.26**   |
-| Gemma3-12b-it                | 78.52 | 34.85        | 82.4     | 74.68  | 44.5      | 82.93     | 66.31   |
-| Qwen3-14B                    | <u>84.82</u> | 50 | **84.6** | <u>85.21</u>| <u>59.5</u> | 88.41     | <u>75.42</u>   |
-| Qwen3-30BA3B                 | **85.1**  | 44.4     | <u>84.4</u> | 84.29  | 58.8      | **90.24**     | 74.54   |
-| Qwen3-8B                     | 81.79 | 38.89        | 81.6     | 83.92  | 49.5      | 85.9      | 70.26   |
-| Phi-4-14B                    | 84.58 | **55.45**    | 80.2     | 63.22  | 42.4      | 87.2      | 68.84   |
 For the MMLU evaluation, we use a 0-shot CoT setting.
 All models are evaluated in non-thinking mode.
 ## Speed
-| Model                               | Memory(GiB)         | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
-|--------------------------------------|---------------------|----------|-----------|--------------|----------------|
-| SmallThinker 21B+sparse              | 11.47               | 30.19    | 23.03     | 10.84        | 6.61           |
-| SmallThinker 21B+sparse+limited memory | limit 8G         | 20.30    | 15.50     | 8.56         | -              |
-| Qwen3 30B A3B                        | 16.20               | 33.52    | 20.18     | 9.07         | -              |
-| Qwen3 30B A3B+limited memory          | limit 8G            | 10.11    | 0.18      | 6.32         | -              |
-| Gemma 3n E2B                         | 1G, theoretically   | 36.88    | 27.06     | 12.50        | 6.66           |
-| Gemma 3n E4B                         | 2G, theoretically   | 21.93    | 16.58     | 7.37         | 4.01           |
 Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0.
 You can deploy SmallThinker with offloading support using [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer/tree/main/smallthinker)

 ---
 language:
 - en
+license: apache-2.0
 pipeline_tag: text-generation
+library_name: transformers
+tags:
+- moe
 ---
 ## Introduction
 <p align="center">
+       &nbsp&nbsp🤗 <a href="https://huggingface.co/PowerInfer">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/PowerInfer">ModelScope</a>&nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://github.com/SJTU-IPADS/SmallThinker/blob/main/smallthinker-technical-report.pdf">Technical Report</a> &nbsp&nbsp
+       &nbsp&nbsp 📚 <a href="https://huggingface.co/papers/2507.20984">Paper</a> &nbsp&nbsp | &nbsp&nbsp 💻 <a href="https://github.com/SJTU-IPADS/SmallThinker">GitHub Repo</a> &nbsp&nbsp
 </p>
 SmallThinker is a family of **on-device native** Mixture-of-Experts (MoE) language models specially designed for local deployment,
 SmallThinker brings powerful, private, and low-latency AI directly to your personal devices,
 without relying on the cloud.
 ## Performance
 Note: The model is trained mainly on English.
+| Model | MMLU | GPQA-diamond | MATH-500 | IFEVAL | LIVEBENCH | HUMANEVAL | Average |
+|---|---|---|---|---|---|---|---|
+| **SmallThinker-21BA3B-Instruct** | 84.43 | <u>55.05</u> | 82.4 | **85.77** | **60.3** | <u>89.63</u> | **76.26** |
+| Gemma3-12b-it | 78.52 | 34.85 | 82.4 | 74.68 | 44.5 | 82.93 | 66.31 |
+| Qwen3-14B | <u>84.82</u> | 50 | **84.6** | <u>85.21</u>| <u>59.5</u> | 88.41 | <u>75.42</u> |
+| Qwen3-30BA3B | **85.1** | 44.4 | <u>84.4</u> | 84.29 | 58.8 | **90.24** | 74.54 |
+| Qwen3-8B | 81.79 | 38.89 | 81.6 | 83.92 | 49.5 | 85.9 | 70.26 |
+| Phi-4-14B | 84.58 | **55.45** | 80.2 | 63.22 | 42.4 | 87.2 | 68.84 |
 For the MMLU evaluation, we use a 0-shot CoT setting.
 All models are evaluated in non-thinking mode.
 ## Speed
+| Model | Memory(GiB) | i9 14900 | 1+13 8ge4 | rk3588 (16G) | Raspberry PI 5 |
+|---|---|---|---|---|---|
+| SmallThinker 21B+sparse | 11.47 | 30.19 | 23.03 | 10.84 | 6.61 |
+| SmallThinker 21B+sparse+limited memory | limit 8G | 20.30 | 15.50 | 8.56 | - |
+| Qwen3 30B A3B | 16.20 | 33.52 | 20.18 | 9.07 | - |
+| Qwen3 30B A3B+limited memory | limit 8G | 10.11 | 0.18 | 6.32 | - |
+| Gemma 3n E2B | 1G, theoretically | 36.88 | 27.06 | 12.50 | 6.66 |
+| Gemma 3n E4B | 2G, theoretically | 21.93 | 16.58 | 7.37 | 4.01 |
 Note: i9 14900, 1+13 8ge4 use 4 threads, others use the number of threads that can achieve the maximum speed. All models here have been quantized to q4_0.
 You can deploy SmallThinker with offloading support using [PowerInfer](https://github.com/SJTU-IPADS/PowerInfer/tree/main/smallthinker)