mratsim commited on Nov 9

Commit

9dea919

verified ·

1 Parent(s): cdf07cc

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

README.md +163 -0
chat_template.jinja +87 -0
config.json +72 -0
generation_config.json +6 -0
model-00001-of-00015.safetensors +3 -0
model-00002-of-00015.safetensors +3 -0
model-00003-of-00015.safetensors +3 -0
model-00004-of-00015.safetensors +3 -0
model-00005-of-00015.safetensors +3 -0
model-00006-of-00015.safetensors +3 -0
model-00007-of-00015.safetensors +3 -0
model-00008-of-00015.safetensors +3 -0
model-00009-of-00015.safetensors +3 -0
model-00010-of-00015.safetensors +3 -0
model-00011-of-00015.safetensors +3 -0
model-00012-of-00015.safetensors +3 -0
model-00013-of-00015.safetensors +3 -0
model-00014-of-00015.safetensors +3 -0
model-00015-of-00015.safetensors +3 -0
model.safetensors.index.json +0 -0
recipe.yaml +6 -0
special_tokens_map.json +23 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,163 @@

+---
+base_model:
+- MarsupialAI/Monstral-123B-v2
+datasets:
+- neuralmagic/calibration
+- HuggingFaceH4/ultrachat_200k
+- nvidia/OpenCodeInstruct
+- CSJianYang/CodeArena
+- nvidia/OpenScienceReasoning-2
+- MegaScience/MegaScience
+- Gryphe/Opus-WritingPrompts
+- ServiceNow-AI/M2Lingual
+- anthracite-org/stheno-filtered-v1.1
+- zerofata/Instruct-Anime
+- zerofata/Instruct-Anime-CreativeWriting
+- sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo
+- nvidia/OpenMathInstruct-2
+- fka/awesome-chatgpt-prompts
+- databricks/databricks-dolly-15k
+- FreedomIntelligence/SocraticChat
+- ruggsea/stanford-encyclopedia-of-philosophy_instruct
+- mlfoundations-dev/stackexchange_philosophy
+- theoldmandthesea/17k_business_book
+- anthracite-org/nopm_claude_writing_fixed
+- PJMixers/grimulkan_physical-reasoning-ShareGPT
+- PJMixers/grimulkan_theory-of-mind-ShareGPT
+- HuggingFaceH4/no_robots
+- nvidia/HelpSteer
+- garage-bAInd/Open-Platypus
+- AquaV/US-Army-Survival-Sharegpt
+- AquaV/Interrogation-Sharegpt
+- AquaV/Multi-Environment-Operations-Sharegpt
+- AquaV/Resistance-Sharegpt
+- PocketDoc/Dans-Kinomaxx-VanillaBackrooms
+pipeline_tag: text-generation
+tags:
+- text adventure
+- roleplay
+- rpg
+- creative writing
+- nvfp4
+- vllm
+- conversational
+---
+# Monstral-123B-v2 (NVFP4 quant)
+This repo contains Monstral-123B-v2 quantized with NVFP4, a 4-bit compression suitable for max performance on Nvidia RTX 5000s series GPUs.
+> ℹ️ This model is limited to Hopper and Blackwell family of GPUs and will not work with RTX 3000s and RTX 4000s GPUs.
+> Please use the NVFP4A16 model otherwise OR enable slow emulation `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+- Original Model:
+  - [MarsupialAI/Monstral-123B-v2](https://huggingface.co/MarsupialAI/Monstral-123B-v2)
+- RTX 3000s and 4000s GPUs fallback model:
+  - TBD
+NVFP4 writeups:
+- https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/
+- https://arxiv.org/pdf/2509.25149
+## 📥 Usage & Running Instructions
+The model was tested with vLLM + 1x RTX Pro 6000.
+### Hardware
+As of October 2025, this quantized model can only be run on architectures with hardware FP4 support (Blackwell or later).
+Cheaper GPUs with 24GB of VRAM (RTX 5080 Super) that can run this model in pairs are expected in Q1 2026.
+You may still run this model with emulation albeit slowly by setting `export VLLM_USE_NVFP4_CT_EMULATIONS=1`
+otherwise use the alternative [TBD]
+### Recommendations
+It is however recommended to use at most 65K context to avoid significant degradation (https://fiction.live/stories/Fiction-liveBench-Sept-29-2025/oQdzQvKHw8JyXbN87).
+This model is recommended with "min-p" sampling, this sampling is available through
+both the oldest Text completions API and the Chat completions API (and there is a new Response API),
+however most LLM frontends only support modifying min-p when using Text completions.
+You can however use `--override-generation-config "${SAMPLER_JSONCONFIG}"` to override the sampler (which is a merge of generation_config.json and vLLM defaults)
+### Running script
+```bash
+# Model configuration (Mandatory)
+MODEL="mratsim/Monstral-123B-v2-NVFP4"
+MODELNAME="Monstral-123B-v2"
+CONTEXT_SIZE=32768
+GPU_UTIL=0.85
+# Sampling configuration (Optional, if departing from `generation_config.json`)
+# Using default vLLM values
+SAMPLER_OVERRIDE='{"temperature": 1, "min_p": 0, "top_p": 1, "repetition_penalty": 1}'
+# Prevent vLLM from using 100% CPU when idle (Very Recommended)
+export VLLM_SLEEP_WHEN_IDLE=1
+# Use FlashInfer backend (fastest, recommended, "instant" context reprocessing)
+export VLLM_ATTENTION_BACKEND=FLASHINFER
+vllm serve "${MODEL}" \
+  --served-model-name "${MODELNAME}" \
+  --gpu-memory-utilization ${GPU_UTIL} \
+  --max-model-len "${CONTEXT_SIZE}" \
+  --override-generation-config "${SAMPLER_OVERRIDE}"
+```
+> ℹ️ The FlashInfer backend may fail with an error similar to
+> `Failed to allocate memory for batch_prefill_tmp_v with size XYZ and alignment 16 in AlignedAllocator`.
+>
+> A workaround is running a sed replacement command within vllm install to increase buffer space
+> ```bash
+> sed -i 's/FLASHINFER_WORKSPACE_BUFFER_SIZE = 256 \* 1024 \* 1024/FLASHINFER_WORKSPACE_BUFFER_SIZE = 512 \* 1024 \* 1024/g' vllm/v1/attention/backends/flashinfer.py
+> ```
+> This will be fixed by PR https://github.com/vllm-project/vllm/pull/25344
+## 🔬 Quantization method
+The llmcompressor library was used with the following recipe:
+```yaml
+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: NVFP4
+```
+and calibrated on 3 samples per the following datasets (total 90), 8192 sequence length:
+- [neuralmagic/calibration](https://huggingface.co/datasets/neuralmagic/calibration)
+- [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)
+- [nvidia/OpenCodeInstruct](https://huggingface.co/datasets/nvidia/OpenCodeInstruct)
+- [CSJianYang/CodeArena](https://huggingface.co/datasets/CSJianYang/CodeArena)
+- [nvidia/OpenScienceReasoning-2](https://huggingface.co/datasets/nvidia/OpenScienceReasoning-2)
+- [MegaScience/MegaScience](https://huggingface.co/datasets/MegaScience/MegaScience)
+- [Gryphe/Opus-WritingPrompts](https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts)
+- [ServiceNow-AI/M2Lingual](https://huggingface.co/datasets/ServiceNow-AI/M2Lingual)
+- [anthracite-org/stheno-filtered-v1.1](https://huggingface.co/datasets/anthracite-org/stheno-filtered-v1.1)
+- [zerofata/Instruct-Anime](https://huggingface.co/datasets/zerofata/Instruct-Anime)
+- [zerofata/Instruct-Anime-CreativeWriting](https://huggingface.co/datasets/zerofata/Instruct-Anime-CreativeWriting)
+- [sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo](https://huggingface.co/datasets/sam-paech/gutenberg3-generalfiction-scifi-fantasy-romance-adventure-dpo)
+- [nvidia/OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2)
+- [fka/awesome-chatgpt-prompts](https://huggingface.co/datasets/fka/awesome-chatgpt-prompts)
+- [databricks/databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)
+- [FreedomIntelligence/SocraticChat](https://huggingface.co/datasets/FreedomIntelligence/SocraticChat)
+- [ruggsea/stanford-encyclopedia-of-philosophy_instruct](https://huggingface.co/datasets/ruggsea/stanford-encyclopedia-of-philosophy_instruct)
+- [mlfoundations-dev/stackexchange_philosophy](https://huggingface.co/datasets/mlfoundations-dev/stackexchange_philosophy)
+- [theoldmandthesea/17k_business_book](https://huggingface.co/datasets/theoldmandthesea/17k_business_book)
+- [anthracite-org/nopm_claude_writing_fixed](https://huggingface.co/datasets/anthracite-org/nopm_claude_writing_fixed)
+- [PJMixers/grimulkan_physical-reasoning-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_physical-reasoning-ShareGPT)
+- [PJMixers/grimulkan_theory-of-mind-ShareGPT](https://huggingface.co/datasets/PJMixers/grimulkan_theory-of-mind-ShareGPT)
+- [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)
+- [nvidia/HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)
+- [garage-bAInd/Open-Platypus](https://huggingface.co/datasets/garage-bAInd/Open-Platypus)
+- [AquaV/US-Army-Survival-Sharegpt](https://huggingface.co/datasets/AquaV/US-Army-Survival-Sharegpt)
+- [AquaV/Interrogation-Sharegpt](https://huggingface.co/datasets/AquaV/Interrogation-Sharegpt)
+- [AquaV/Multi-Environment-Operations-Sharegpt](https://huggingface.co/datasets/AquaV/Multi-Environment-Operations-Sharegpt)
+- [AquaV/Resistance-Sharegpt](https://huggingface.co/datasets/AquaV/Resistance-Sharegpt)
+- [PocketDoc/Dans-Kinomaxx-VanillaBackrooms](https://huggingface.co/datasets/PocketDoc/Dans-Kinomaxx-VanillaBackrooms)
+NVFP4 quantization requires very few number of samples, llmcompressor uses 20 in their examples.
+Comparatively 512 is recommended for GPTQ and 64 for AWQ (https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf)0

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,87 @@

+{%- if messages[0]["role"] == "system" %}
+    {%- set system_message = messages[0]["content"] %}
+    {%- set loop_messages = messages[1:] %}
+{%- else %}
+    {%- set loop_messages = messages %}
+{%- endif %}
+{%- if not tools is defined %}
+    {%- set tools = none %}
+{%- endif %}
+{%- set user_messages = loop_messages | selectattr("role", "equalto", "user") | list %}
+{#- This block checks for alternating user/assistant messages, skipping tool calling messages #}
+{%- set ns = namespace() %}
+{%- set ns.index = 0 %}
+{%- for message in loop_messages %}
+    {%- if not (message.role == "tool" or message.role == "tool_results" or (message.tool_calls is defined and message.tool_calls is not none)) %}
+        {%- if (message["role"] == "user") != (ns.index % 2 == 0) %}
+            {{- raise_exception("After the optional system message, conversation roles must alternate user/assistant/user/assistant/...") }}
+        {%- endif %}
+        {%- set ns.index = ns.index + 1 %}
+    {%- endif %}
+{%- endfor %}
+{{- bos_token }}
+{%- for message in loop_messages %}
+    {%- if message["role"] == "user" %}
+        {%- if tools is not none and (message == user_messages[-1]) %}
+            {{- "[AVAILABLE_TOOLS] [" }}
+            {%- for tool in tools %}
+                {%- set tool = tool.function %}
+                {{- '{"type": "function", "function": {' }}
+                {%- for key, val in tool.items() if key != "return" %}
+                    {%- if val is string %}
+                        {{- '"' + key + '": "' + val + '"' }}
+                    {%- else %}
+                        {{- '"' + key + '": ' + val|tojson }}
+                    {%- endif %}
+                    {%- if not loop.last %}
+                        {{- ", " }}
+                    {%- endif %}
+                {%- endfor %}
+                {{- "}}" }}
+                {%- if not loop.last %}
+                    {{- ", " }}
+                {%- else %}
+                    {{- "]" }}
+                {%- endif %}
+            {%- endfor %}
+            {{- "[/AVAILABLE_TOOLS]" }}
+            {%- endif %}
+        {%- if loop.last and system_message is defined %}
+            {{- "[INST] " + system_message + "\n\n" + message["content"] + "[/INST]" }}
+        {%- else %}
+            {{- "[INST] " + message["content"] + "[/INST]" }}
+        {%- endif %}
+    {%- elif message.tool_calls is defined and message.tool_calls is not none %}
+        {{- "[TOOL_CALLS] [" }}
+        {%- for tool_call in message.tool_calls %}
+            {%- set out = tool_call.function|tojson %}
+            {{- out[:-1] }}
+            {%- if not tool_call.id is defined or tool_call.id|length != 9 %}
+                {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
+            {%- endif %}
+            {{- ', "id": "' + tool_call.id + '"}' }}
+            {%- if not loop.last %}
+                {{- ", " }}
+            {%- else %}
+                {{- "]" + eos_token }}
+            {%- endif %}
+        {%- endfor %}
+    {%- elif message["role"] == "assistant" %}
+        {{- " " + message["content"]|trim + eos_token}}
+    {%- elif message["role"] == "tool_results" or message["role"] == "tool" %}
+        {%- if message.content is defined and message.content.content is defined %}
+            {%- set content = message.content.content %}
+        {%- else %}
+            {%- set content = message.content %}
+        {%- endif %}
+        {{- '[TOOL_RESULTS] {"content": ' + content|string + ", " }}
+        {%- if not message.tool_call_id is defined or message.tool_call_id|length != 9 %}
+            {{- raise_exception("Tool call IDs should be alphanumeric strings with length 9!") }}
+        {%- endif %}
+        {{- '"call_id": "' + message.tool_call_id + '"}[/TOOL_RESULTS]' }}
+    {%- else %}
+        {{- raise_exception("Only user and assistant roles are supported, with the exception of an initial optional system message!") }}
+    {%- endif %}
+{%- endfor %}

config.json ADDED Viewed

	@@ -0,0 +1,72 @@

+{
+  "architectures": [
+    "MistralForCausalLM"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 1,
+  "dtype": "bfloat16",
+  "eos_token_id": 2,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 12288,
+  "initializer_range": 0.02,
+  "intermediate_size": 28672,
+  "max_position_embeddings": 131072,
+  "model_type": "mistral",
+  "num_attention_heads": 96,
+  "num_hidden_layers": 88,
+  "num_key_value_heads": 8,
+  "quantization_config": {
+    "config_groups": {
+      "group_0": {
+        "format": "nvfp4-pack-quantized",
+        "input_activations": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": "local",
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float"
+        },
+        "output_activations": null,
+        "targets": [
+          "Linear"
+        ],
+        "weights": {
+          "actorder": null,
+          "block_structure": null,
+          "dynamic": false,
+          "group_size": 16,
+          "num_bits": 4,
+          "observer": "minmax",
+          "observer_kwargs": {},
+          "strategy": "tensor_group",
+          "symmetric": true,
+          "type": "float"
+        }
+      }
+    },
+    "format": "nvfp4-pack-quantized",
+    "global_compression_ratio": null,
+    "ignore": [
+      "lm_head"
+    ],
+    "kv_cache_scheme": null,
+    "quant_method": "compressed-tensors",
+    "quantization_status": "compressed",
+    "sparsity_config": {},
+    "transform_config": {},
+    "version": "0.12.2"
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_theta": 1000000.0,
+  "sliding_window": null,
+  "tie_word_embeddings": false,
+  "transformers_version": "4.56.2",
+  "use_cache": true,
+  "vocab_size": 32768
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.56.2"
+}

model-00001-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eb1bb636dd11b91a6d7094aa210b72124792104f7a1dffb5e8b62817cdc8fc3d
+size 4882434912

model-00002-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c5e7bbe004812157312d617881686fb7d5a14973a5f160e3af1cbcd8903ccff9
+size 4869903000

model-00003-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68ddadb0015c469d2dd31a224a81eaa67c20c11a917829cde5ffafd1335c96a7
+size 4869903136

model-00004-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca270cbd4b5264aa895d86bdc51b4d4cbcef718b1cd6594cdee1f9d8e0e22b22
+size 4969044352

model-00005-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:015ddc0b600fb39d6d6f0bfed405cebe71133de48a422df50bff739a5c6c0736
+size 4954838264

model-00006-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:512ef6751735dd70868ccfdb65a0793d5420731a47a87a208399fba927727e8f
+size 4869903136

model-00007-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fc38472c190104df1f64c3c2c040145a4bc1ce6f132b46488eb1dadf20e03107
+size 4969044352

model-00008-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a3fa5410e2ba4820e5ffecf4e569a31979f3b9218f00273977e813b9fb370703
+size 4954838264

model-00009-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a8e77db4bd098a5cd7a12625b3bc375d305990b38021911e4afe3cae411ef395
+size 4869903136

model-00010-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8b17418eb1d683d62712baf719afa0ec9f39dedc3d52b4968bee8acb777e430
+size 4969044352

model-00011-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25dc77eadc5eb6ab43f3219627889015863aaa5f2e6261ce045fbf24992aba66
+size 4954838264

model-00012-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:06d4fae5a2bcc22c075395bb8c66dbd919757b756e1e1dfa547a23acf34e70b8
+size 4869903136

model-00013-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:75da9ea77cd10b8cfc3704a5c7b7246f68dc869f21cb6ead4be92979bc08b5a6
+size 4969044352

model-00014-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8608daf67c0028261bc74ebd6c4f6e08c121bb9a956359a39fcdc976685f3927
+size 4954838264

model-00015-of-00015.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58d340a00490def8c97844f4dd0d3074ffcb101cb18ea11b55e96ffd65e6120d
+size 1201743176

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

recipe.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+default_stage:
+  default_modifiers:
+    QuantizationModifier:
+      targets: [Linear]
+      ignore: [lm_head]
+      scheme: NVFP4

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59f95e28944c062244741268596badc900df86c7f5ded05088d2da22a7379e06
+size 587583

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff