---
library_name: mlx
tags:
  - mlx
  - quantized
  - mixed-precision
  - minimax
  - minimax_m2
  - moe
license: other
license_name: minimax-m2-license
license_link: LICENSE
base_model: MiniMaxAI/MiniMax-M2.7
base_model_relation: quantized
pipeline_tag: text-generation
language: en
---

# MiniMax-M2.7 — 100 GB (MLX)

Mixed-precision MLX build of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7), prepared by [baa.ai](https://baa.ai).

## Metrics

| Metric | Value |
|---|---|
| **Size on disk** | **100.1 GB** (20 shards) |
| Group size | 64 |
| Framework | MLX (Apple Silicon) |

## Benchmarks

| Benchmark | Score | Notes |
|---|---|---|
| **HumanEval pass@1 (single-shot)** | **87.2%** (143/164) | 164/164 completed, 0 skipped |
| **HumanEval pass@1 (best-of-2)** | **94.5%** (155/164) | Retry of the 21 single-shot failures recovered 12 |
| Decode throughput (Apple Silicon) | **36.4 tok/s** (wall-gen) / 36.8 tok/s (task-mean) | 296,683 tokens generated over 136.1 min |

Settings for both runs match the **Recommended inference settings** below.

## Recommended inference settings

```python
sampler_params = {
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.1,
    "max_tokens": 8192,
}
```

## Chat template — thinking mode

MiniMax-M2.7 uses a `<think>…</think>` reasoning block. **Important:** the base chat template injects `<think>\n` at the end of the prompt before generation, so the model's output begins *inside* the reasoning block with no opening tag. Strip everything up to and including the first `</think>`:

```python
def strip_thinking(text: str) -> str:
    if "</think>" in text:
        return text.split("</think>", 1)[1].strip()
    return text.strip()
```

Give the model enough token budget that it can finish reasoning and emit the `</think>` closing tag — we recommend at least 4096, and 8192 for harder problems.

## Usage

```python
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("baa-ai/MiniMax-M2.7-RAM-100GB-MLX")

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
logits_processors = make_logits_processors(repetition_penalty=1.1)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a Python function that reverses a string."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=8192,
    sampler=sampler,
    logits_processors=logits_processors,
)

if "</think>" in response:
    response = response.split("</think>", 1)[1].strip()
print(response)
```

## Hardware

- Apple Silicon Mac with **~112 GB** unified memory recommended for comfortable inference.
- Runs on less with swap, at substantially reduced throughput.

## Variants

| Variant | Size | Link |
|---|---|---|
| **100 GB** | **100.1 GB** | [**baa-ai/MiniMax-M2.7-RAM-100GB-MLX**](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-100GB-MLX) |
| 111 GB | 110.9 GB | [baa-ai/MiniMax-M2.7-RAM-111GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-111GB-MLX) |
| 116 GB | 116.0 GB | [baa-ai/MiniMax-M2.7-RAM-116GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-116GB-MLX) |
| 120 GB | 120.1 GB | [baa-ai/MiniMax-M2.7-RAM-120GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-120GB-MLX) |

## License

Inherited from the upstream [MiniMax-M2.7 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE): non-commercial use permitted; commercial use requires written authorization from MiniMax.

---
*Quantized by [baa.ai](https://baa.ai)*