--- library_name: mlx tags: - mlx - quantized - mixed-precision - minimax - minimax_m2 - moe license: other license_name: minimax-m2-license license_link: LICENSE base_model: MiniMaxAI/MiniMax-M2.7 base_model_relation: quantized pipeline_tag: text-generation language: en --- # MiniMax-M2.7 — 100 GB (MLX) Mixed-precision MLX build of [MiniMaxAI/MiniMax-M2.7](https://huggingface.co/MiniMaxAI/MiniMax-M2.7), prepared by [baa.ai](https://baa.ai). ## Metrics | Metric | Value | |---|---| | **Size on disk** | **100.1 GB** (20 shards) | | Group size | 64 | | Framework | MLX (Apple Silicon) | ## Benchmarks | Benchmark | Score | Notes | |---|---|---| | **HumanEval pass@1 (single-shot)** | **87.2%** (143/164) | 164/164 completed, 0 skipped | | **HumanEval pass@1 (best-of-2)** | **94.5%** (155/164) | Retry of the 21 single-shot failures recovered 12 | | Decode throughput (Apple Silicon) | **36.4 tok/s** (wall-gen) / 36.8 tok/s (task-mean) | 296,683 tokens generated over 136.1 min | Settings for both runs match the **Recommended inference settings** below. ## Recommended inference settings ```python sampler_params = { "temperature": 1.0, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "max_tokens": 8192, } ``` ## Chat template — thinking mode MiniMax-M2.7 uses a `` reasoning block. **Important:** the base chat template injects `\n` at the end of the prompt before generation, so the model's output begins *inside* the reasoning block with no opening tag. Strip everything up to and including the first ``: ```python def strip_thinking(text: str) -> str: if "" in text: return text.split("", 1)[1].strip() return text.strip() ``` Give the model enough token budget that it can finish reasoning and emit the `` closing tag — we recommend at least 4096, and 8192 for harder problems. ## Usage ```python from mlx_lm import load, generate from mlx_lm.sample_utils import make_sampler, make_logits_processors model, tokenizer = load("baa-ai/MiniMax-M2.7-RAM-100GB-MLX") sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40) logits_processors = make_logits_processors(repetition_penalty=1.1) prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Write a Python function that reverses a string."}], tokenize=False, add_generation_prompt=True, ) response = generate( model, tokenizer, prompt=prompt, max_tokens=8192, sampler=sampler, logits_processors=logits_processors, ) if "" in response: response = response.split("", 1)[1].strip() print(response) ``` ## Hardware - Apple Silicon Mac with **~112 GB** unified memory recommended for comfortable inference. - Runs on less with swap, at substantially reduced throughput. ## Variants | Variant | Size | Link | |---|---|---| | **100 GB** | **100.1 GB** | [**baa-ai/MiniMax-M2.7-RAM-100GB-MLX**](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-100GB-MLX) | | 111 GB | 110.9 GB | [baa-ai/MiniMax-M2.7-RAM-111GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-111GB-MLX) | | 116 GB | 116.0 GB | [baa-ai/MiniMax-M2.7-RAM-116GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-116GB-MLX) | | 120 GB | 120.1 GB | [baa-ai/MiniMax-M2.7-RAM-120GB-MLX](https://huggingface.co/baa-ai/MiniMax-M2.7-RAM-120GB-MLX) | ## License Inherited from the upstream [MiniMax-M2.7 license](https://huggingface.co/MiniMaxAI/MiniMax-M2.7/blob/main/LICENSE): non-commercial use permitted; commercial use requires written authorization from MiniMax. --- *Quantized by [baa.ai](https://baa.ai)*