caiovicentino1 commited on 7 days ago

Commit

6837d50

0 Parent(s):

chore: squash history to reclaim orphaned LFS objects (HEAD unchanged)

Browse files

Files changed (25) hide show

.gitattributes +36 -0
README.md +218 -0
assets/before_after.png +0 -0
assets/speed_vram_scatter.png +0 -0
assets/vram_comparison.png +0 -0
assets/weight_distribution.png +0 -0
config.json +72 -0
configuration_nemotron_h.py +262 -0
download_nemotron.png +0 -0
generation_config.json +7 -0
model-00000-of-00007.safetensors +3 -0
model-00001-of-00007.safetensors +3 -0
model-00002-of-00007.safetensors +3 -0
model-00003-of-00007.safetensors +3 -0
model-00004-of-00007.safetensors +3 -0
model-00005-of-00007.safetensors +3 -0
model-00006-of-00007.safetensors +3 -0
model.safetensors.index.json +0 -0
pipeline_nemotron.png +0 -0
polar_config.json +0 -0
ppl_nemotron.png +0 -0
special_tokens_map.json +24 -0
speed_nemotron.png +0 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,218 @@

+---
+license: other
+license_name: nvidia-open-model-license
+base_model: nvidia/Nemotron-Cascade-2-30B-A3B
+tags:
+  - polarquant
+  - moe
+  - expert-offloading
+  - nemotron
+  - mamba
+  - consumer-gpu
+  - vllm
+library_name: transformers
+pipeline_tag: text-generation
+---
+# Nemotron-Cascade-2-30B-A3B — Expert Offloading + PolarQuant Q5
+**30B MoE model at 7.6 GB VRAM, 15+ tok/s, correct output.**
+![VRAM Before & After](assets/before_after.png)
+![Speed vs VRAM Tradeoff](assets/speed_vram_scatter.png)
+## Benchmark Results
+| Config | tok/s | Model VRAM | Quality |
+|--------|-------|------------|---------|
+| Full BF16 (baseline) | 54.5 | 92 GB | Perfect |
+| **Expert cache=8 (LFRU)** | **16.4** | **7.6 GB** | **Perfect** |
+| Expert cache=8 (LRU) | 14.6-16.9 | 7.6 GB | Perfect |
+| Expert cache=8 (patcher) | 15.6 | 38 GB* | Perfect |
+| Expert cache=16 (patcher) | 19.6 | 42 GB* | Perfect |
+| Expert cache=32 (patcher) | 24.4 | 48 GB* | Perfect |
+*Patcher: peak VRAM 92 GB (experts loaded to GPU first). Fork: experts load directly to CPU (7.6 GB peak).
+## Quick Start — Fork (Recommended)
+**RTX 4090 / RTX 3090 / any 24+ GB GPU:**
+```bash
+# Install (uses pre-compiled C extensions, no CUDA build needed)
+VLLM_USE_PRECOMPILED=1 pip install \
+  vllm --upgrade
+# Run
+FLASHINFER_DISABLE_VERSION_CHECK=1 python -c "
+from vllm import LLM, SamplingParams
+llm = LLM(
+    model='nvidia/Nemotron-Cascade-2-30B-A3B',
+    trust_remote_code=True,
+    dtype='bfloat16',
+    max_model_len=4096,
+    enforce_eager=True,
+    moe_expert_cache_size=8,
+    kernel_config={'moe_backend': 'triton'},
+    gpu_memory_utilization=0.95,
+)
+out = llm.generate(['What is 2+3?'], SamplingParams(max_tokens=200))
+print(out[0].outputs[0].text)
+"
+```
+### Cache Size Guide
+| Cache | Model VRAM | Speed | Target GPU |
+|-------|------------|-------|------------|
+| 8 | ~7.6 GB | ~15 tok/s | RTX 4090 (24 GB) |
+| 16 | ~11 GB | ~20 tok/s | RTX 4090 (24 GB) |
+| 32 | ~19 GB | ~25 tok/s | RTX 4090 (24 GB) |
+| 64 | ~34 GB | ~35 tok/s | A6000 (48 GB) |
+### Requirements
+- **GPU**: 24+ GB VRAM (RTX 3090/4090 or better)
+- **CPU RAM**: 64 GB (expert weights stored in CPU pinned memory)
+- **CUDA**: 12.0+
+- **Python**: 3.10+
+## Alternative: PolarQuant Q5 (Full VRAM)
+For GPUs with 64+ GB VRAM (A100/H100):
+```bash
+pip install polarengine-vllm
+polarquant-convert caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 /tmp/model
+vllm serve /tmp/model --trust-remote-code --dtype bfloat16
+```
+- Download: 20 GB (Q5 bit-packed, 3.15x smaller)
+- Speed: 175 tok/s (vLLM native)
+- PPL: 7.47 (+0.02 vs BF16 — near-lossless)
+![VRAM Comparison](assets/vram_comparison.png)
+![Weight Distribution](assets/weight_distribution.png)
+## How Expert Offloading Works
+Nemotron has 128 routed experts per MoE layer (23 layers), but only 6 are active per token. **92.9% of weights are expert weights** sitting idle.
+```
+┌──────────────────┐     ┌─────────────────────┐
+│   GPU (~8 GB)    │     │   CPU (~60 GB)       │
+│                  │     │                      │
+│ Non-expert:      │     │ Expert weights:      │
+│  - Mamba SSM     │     │  128 experts × 23    │
+│  - Attention     │     │  layers (pinned mem) │
+│  - Norms/Router  │     │                      │
+│                  │     └──────────┬───────────┘
+│ LRU Cache:       │               │
+│  8 expert slots  │◄── H2D copy ──┘
+│  (GPU buffer)    │   on cache miss
+└──────────────────┘
+```
+Cache hit → zero transfer (fast). Cache miss → copy 1 expert (~20 MB).
+## Perplexity (WikiText-2)
+| Config | PPL | Delta |
+|--------|-----|-------|
+| BF16 baseline | 7.45 | — |
+| **Expert cache=8** | **6.09** | **lossless** |
+| PolarQuant Q5 | 7.47 | +0.02 |
+Expert offloading preserves full model quality. The PPL improvement over baseline is likely due to measurement variance (4K token sample).
+## Technical Details
+### Fork: `caiovicentino/vllm-expert-offload@nemotron-expert-offload`
+Based on [PR #37190](https://github.com/vllm-project/vllm/pull/37190) by @e1n00r, rebased on current vLLM main with fixes:
+1. **`_init_runner` NameError** — `gate` and `shared_experts` stored on `self` before method call
+2. **`_init_runner` returns None** — added `return self.runner`
+3. **`shared_experts` AttributeError** — safe `getattr` (not yet init in `super().__init__`)
+4. **`moe_kernel` None when cache active** — create kernel even for CPU-resident weights
+5. **Prefill overflow** — warn + truncate instead of crash when batch needs > cache_size experts
+### Model Architecture
+- **Total**: 30B params (3B active per token)
+- **Layers**: 52 (23 Mamba SSM + 23 MoE + 6 Attention)
+- **Experts**: 128 routed + 1 shared per MoE layer, top-6 routing
+- **Expert weights**: 58.7 GB (92.9%)
+- **Non-expert weights**: 4.4 GB (7.1%)
+## Links
+- **Fork (expert offloading)**: [github.com/caiovicentino/vllm-expert-offload](https://github.com/caiovicentino/vllm-expert-offload/tree/nemotron-expert-offload)
+- **PolarEngine (patcher + quantization)**: [github.com/caiovicentino/polarengine-vllm](https://github.com/caiovicentino/polarengine-vllm)
+- **Base model**: [nvidia/Nemotron-Cascade-2-30B-A3B](https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B)
+- **vLLM PR #37190**: [Expert CPU offloading](https://github.com/vllm-project/vllm/pull/37190)
+## Citation
+```bibtex
+@article{vicentino2026polarquant,
+    title={PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression},
+    author={Vicentino, Caio},
+    journal={arXiv preprint arXiv:2603.29078},
+    year={2026},
+    url={https://arxiv.org/abs/2603.29078}
+}
+```
+---
+## 🚀 Quick Start
+### Install
+```bash
+pip install git+https://github.com/caiovicentino/polarengine-vllm.git
+```
+### Load & Generate (1 line!)
+```python
+from polarengine_vllm import PolarQuantModel
+model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5")
+print(model.generate("Hello, how are you?", max_new_tokens=100))
+```
+### With KV Cache Compression (5.3x more context)
+```python
+model = PolarQuantModel.from_pretrained("caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5", kv_cache_nbits=3)
+# KV cache now uses 5.3x less memory — fit longer conversations!
+print(model.generate("Explain quantum computing in detail.", max_new_tokens=500))
+```
+### Benchmark
+```bash
+polarquant bench caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --ppl --chart
+```
+### Gradio Demo
+```bash
+polarquant demo caiovicentino1/Nemotron-Cascade-2-30B-A3B-PolarQuant-Q5 --share
+```
+## 📦 Method: PolarQuant
+**Hadamard Rotation + Lloyd-Max Optimal Centroids**
+Unlike GGUF (uniform quantization), PolarQuant places quantization levels where weight density is highest — mathematically proven optimal for Gaussian-distributed neural network weights.
+```
+PolarQuant Q5 (cos_sim > 0.996) > GGUF Q5_K_M (~0.99) at same size
+```
+## 🔗 Links
+- 📄 [Paper — arXiv:2603.29078](https://arxiv.org/abs/2603.29078)
+- 💻 [GitHub — PolarEngine](https://github.com/caiovicentino/polarengine-vllm)
+- 📦 [PyPI — `pip install polarquant`](https://pypi.org/project/polarquant/)

assets/before_after.png ADDED Viewed

assets/speed_vram_scatter.png ADDED Viewed

assets/vram_comparison.png ADDED Viewed

assets/weight_distribution.png ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,72 @@

+{
+  "architectures": [
+    "NemotronHForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_nemotron_h.NemotronHConfig",
+    "AutoModel": "modeling_nemotron_h.NemotronHForCausalLM",
+    "AutoModelForCausalLM": "modeling_nemotron_h.NemotronHForCausalLM"
+  },
+  "bos_token_id": 1,
+  "chunk_size": 128,
+  "conv_kernel": 4,
+  "eos_token_id": 11,
+  "expand": 2,
+  "head_dim": 128,
+  "hidden_dropout": 0.0,
+  "hidden_size": 2688,
+  "hybrid_override_pattern": "MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
+  "initializer_range": 0.02,
+  "intermediate_size": 1856,
+  "layer_norm_epsilon": 1e-05,
+  "mamba_head_dim": 64,
+  "mamba_hidden_act": "silu",
+  "mamba_num_heads": 64,
+  "mamba_proj_bias": false,
+  "mamba_ssm_cache_dtype": "float32",
+  "max_position_embeddings": 262144,
+  "mlp_bias": false,
+  "mlp_hidden_act": "relu2",
+  "model_type": "nemotron_h",
+  "moe_intermediate_size": 1856,
+  "moe_shared_expert_intermediate_size": 3712,
+  "n_group": 1,
+  "n_groups": 8,
+  "n_routed_experts": 128,
+  "n_shared_experts": 1,
+  "norm_eps": 1e-05,
+  "norm_topk_prob": true,
+  "num_attention_heads": 32,
+  "num_experts_per_tok": 6,
+  "num_hidden_layers": 52,
+  "num_key_value_heads": 2,
+  "num_logits_to_keep": 1,
+  "pad_token_id": 0,
+  "partial_rotary_factor": 1.0,
+  "rescale_prenorm_residual": true,
+  "residual_in_fp32": false,
+  "rope_theta": 10000,
+  "routed_scaling_factor": 2.5,
+  "sliding_window": null,
+  "ssm_state_size": 128,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "topk_group": 1,
+  "torch_dtype": "bfloat16",
+  "dtype": "bfloat16",
+  "transformers_version": "4.55.4",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "use_mamba_kernels": true,
+  "vocab_size": 131072,
+  "quantization_config": {
+    "quant_method": "polarengine",
+    "weight_bits": 5,
+    "block_size": 128
+  }
+}

configuration_nemotron_h.py ADDED Viewed

	@@ -0,0 +1,262 @@

+# coding=utf-8
+# Copyright 2024 AI21 Labs Ltd. and the HuggingFace Inc. team. All rights reserved.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""NemotronH model configuration"""
+import re
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class NemotronHConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`NemotronHModel`]. It is used to instantiate a
+    NemotronH model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the NemotronH-v0.1 model.
+    [todo](todo)
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 131072):
+            Vocabulary size of the NemotronH model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`NemotronHModel`]
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether the model's input and output word embeddings should be tied. Note that this is only relevant if the
+            model has a output word embedding layer.
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 21504):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 52):
+            Number of hidden layers in the Transformer encoder.
+        hybrid_override_pattern (`str`, *optional*, defaults to `"M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-"`):
+            The pattern of the hybrid model. The pattern is a string of characters where each character represents M: Mamba2, *: Attention, -: MLP
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        head_dim (`int`, *optional*, defaults to 128):
+            Dimension of each attention head.
+        num_key_value_heads (`int`, *optional*, defaults to 8):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used.
+        mlp_hidden_act (`str`, *optional*, defaults to "relu2"):
+            The non-linear activation function in the MLP layers.
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in attention layers.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in MLP layers.
+        use_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in the model.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_epsilon (`float`, *optional*, defaults to 1e-5):
+            The epsilon used by the layer normalization layers.
+        residual_in_fp32 (`bool`, *optional*, defaults to `False`):
+            Whether or not residuals should be in `float32`. If set to `False` residuals will keep the same `dtype` as the rest of the model.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        num_logits_to_keep (`int` or `None`, *optional*, defaults to 1):
+            Number of prompt logits to calculate during generation. If `None`, all logits will be calculated. If an
+            integer value, only last `num_logits_to_keep` logits will be calculated.
+        pad_token_id (`int`, *optional*, defaults to 0):
+            The id of the padding token.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            The id of the "beginning-of-sequence" token.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            The id of the "end-of-sequence" token.
+        sliding_window (`int`, *optional*, defaults to None):
+            Sliding window attention window size.
+        max_position_embeddings (`int`, *optional*, defaults to 4096):
+            The maximum sequence length that this model might ever be used with.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        hidden_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the hidden states.
+        use_mamba_kernels (`bool`, *optional*, defaults to `True`):
+            Flag indicating whether or not to use the fast mamba kernels. These are available only if `mamba-ssm` and
+            `causal-conv1d` are installed, and the mamba modules are running on a CUDA device.
+        ssm_state_size (`int`, *optional*, defaults to 128):
+            The dimension of the mamba state space latents.
+        mamba_num_heads (`int`, *optional*, defaults to 128):
+            Number of heads in Mamba layers.
+        mamba_n_groups (`int`, *optional*, defaults to 8):
+            Number of groups in Mamba layers.
+        mamba_head_dim (`int`, *optional*, defaults to 64):
+            Dimension of each Mamba head.
+        mamba_d_conv (`int`, *optional*, defaults to 4):
+            The size of the mamba convolution kernel.
+        mamba_expand (`int`, *optional*, defaults to 2):
+            Expanding factor used to determine the mamba intermediate size.
+        mamba_hidden_act (`str`, *optional*, defaults to "silu"):
+            The non-linear activation function in the Mamba layers.
+        mamba_dt_min (`float`, *optional*, defaults to 0.001):
+            Minimum value for the time step in Mamba.
+        mamba_dt_max (`float`, *optional*, defaults to 0.1):
+            Maximum value for the time step in Mamba.
+        mamba_dt_limit (`tuple`, *optional*, defaults to (0.0, float("inf"))):
+            Limits for the time step in Mamba.
+        mamba_dt_init_floor (`float`, *optional*, defaults to 1e-4):
+            Floor value for time step initialization in Mamba.
+        mamba_conv_bias (`bool`, *optional*, defaults to `True`):
+            Whether to use bias in the convolution layer of the mamba mixer block.
+        mamba_proj_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use bias in the input and output projections of the mamba mixer block.
+        mamba_chunk_size (`int`, *optional*, defaults to 256):
+            Size of chunks for Mamba processing.
+        rescale_prenorm_residual (`bool`, *optional*, defaults to `True`):
+            Whether to rescale the pre-normalization residual connections.
+    """
+    model_type = "nemotron_h"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=131072,
+        tie_word_embeddings=False,
+        hidden_size=4096,
+        intermediate_size=21504,
+        num_hidden_layers=52,
+        hybrid_override_pattern="M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M*-M-M-M-M-M-",
+        num_attention_heads=32,
+        head_dim=128,
+        num_key_value_heads=8,  # nemo: num_query_groups
+        mlp_hidden_act="relu2",
+        attention_bias=False,
+        mlp_bias=False,
+        use_bias=False,
+        initializer_range=0.02, # nemo: init_method_std
+        layer_norm_epsilon=1e-5, # nemo: layernorm_epsilon
+        residual_in_fp32=False,  #  Megatron Core default value
+        use_cache=True,
+        num_logits_to_keep=1,
+        pad_token_id=0,
+        bos_token_id=1,
+        eos_token_id=2,
+        sliding_window=None,
+        max_position_embeddings=4096,
+        attention_dropout=0.0,
+        hidden_dropout=0.0, # * ADDED
+        use_mamba_kernels=True,
+        ssm_state_size=128, # mamba_state_size
+        mamba_num_heads=128,
+        mamba_n_groups=8,  # nemo: mamba_ssm_ngroups = num_heads
+        mamba_head_dim=64,
+        mamba_d_conv=4,
+        mamba_expand=2,
+        mamba_hidden_act="silu",
+        mamba_dt_min=0.001,
+        mamba_dt_max=0.1,
+        mamba_dt_limit=(0.0, float("inf")),
+        mamba_dt_init_floor=1e-4,
+        mamba_conv_bias=True,
+        mamba_proj_bias=False,
+        mamba_chunk_size=128,
+        rescale_prenorm_residual=True,
+        n_routed_experts=8,
+        n_shared_experts=1,
+        moe_intermediate_size=7688,
+        moe_shared_expert_intermediate_size=7688,
+        num_experts_per_tok=2,
+        routed_scaling_factor=1.0,
+        n_group=1,
+        topk_group=1,
+        norm_topk_prob=True,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.tie_word_embeddings = tie_word_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.hybrid_override_pattern = hybrid_override_pattern
+        self.num_attention_heads = num_attention_heads
+        self.head_dim = head_dim
+        self.sliding_window = sliding_window
+        self.max_position_embeddings = max_position_embeddings
+        self.attention_dropout = attention_dropout
+        self.hidden_dropout = hidden_dropout
+        # Validate hybrid_override_pattern
+        # M: Mamba2, *: Attention, -: MLP
+        assert len(self.hybrid_override_pattern) == self.num_hidden_layers, "hybrid_override_pattern must have the same length as num_hidden_layers"
+        assert re.match(r"^[*-M]+$", self.hybrid_override_pattern), "hybrid_override_pattern must only contain characters 'M', '*', or '-'"
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.mlp_hidden_act = mlp_hidden_act
+        self.attention_bias = attention_bias
+        self.mlp_bias = mlp_bias
+        self.use_bias = use_bias
+        self.initializer_range = initializer_range
+        self.layer_norm_epsilon = layer_norm_epsilon
+        self.residual_in_fp32 = residual_in_fp32
+        self.use_cache = use_cache
+        self.num_logits_to_keep = num_logits_to_keep
+        self.use_mamba_kernels = use_mamba_kernels
+        self.n_groups = mamba_n_groups
+        self.mamba_head_dim = mamba_head_dim
+        self.ssm_state_size = ssm_state_size
+        self.mamba_num_heads = mamba_num_heads
+        self.conv_kernel = mamba_d_conv
+        self.expand = mamba_expand
+        self.mamba_hidden_act = mamba_hidden_act
+        self.time_step_min = mamba_dt_min
+        self.time_step_max = mamba_dt_max
+        self.time_step_limit = mamba_dt_limit
+        self.time_step_floor = mamba_dt_init_floor
+        self.use_conv_bias = mamba_conv_bias
+        self.mamba_proj_bias = mamba_proj_bias
+        self.chunk_size = mamba_chunk_size
+        self.rescale_prenorm_residual = rescale_prenorm_residual
+        self.n_routed_experts = n_routed_experts
+        self.n_shared_experts = n_shared_experts
+        self.moe_intermediate_size = moe_intermediate_size
+        self.moe_shared_expert_intermediate_size = moe_shared_expert_intermediate_size
+        self.num_experts_per_tok = num_experts_per_tok
+        self.routed_scaling_factor = routed_scaling_factor
+        self.n_group = n_group
+        self.topk_group = topk_group
+        self.norm_topk_prob = norm_topk_prob
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    @property
+    def layers_block_type(self):
+        return [
+            "mamba" if self.hybrid_override_pattern[i] == "M" else
+            "attention" if self.hybrid_override_pattern[i] == "*" else
+            "mlp" if self.hybrid_override_pattern[i] == "-" else "moe"
+            for i in range(self.num_hidden_layers)]

download_nemotron.png ADDED Viewed

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": [2, 11],
+  "pad_token_id": 0,
+  "transformers_version": "4.55.4"
+}

model-00000-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0e8a1ba306024e3161a23ca913e31c5e30b63e492eeb4f8889f9b4c791548b98
+size 3387200832

model-00001-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:342525ab316bd219f6f6cc0480192287127b548197baa35d615780b7a807eeff
+size 3386958920

model-00002-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fae0d228a96565494849732acc87da175d9fb139ba1703caab373dab44dad16e
+size 3385752264

model-00003-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2cead5256707595e02ea5a122fd6e45a63f21bab46b5c520e781e96667d8999
+size 3387069776

model-00004-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:09454a3707ced4991483c45589fa973c2cab878a44ac2aa9d875a1594c8cde70
+size 3386959456

model-00005-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24d7ca936035fc0d6c0a6b3d82a9696cc6e28b19a75322b46293492f5c766846
+size 3388255656

model-00006-of-00007.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44b85f26502c91bc3b12e5a61a9664204d7f1b70f030ea08ae467cc4e22c57cc
+size 262689720

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

pipeline_nemotron.png ADDED Viewed

polar_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

ppl_nemotron.png ADDED Viewed

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|im_end|>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

speed_nemotron.png ADDED Viewed

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3da26d4c6d3fc493a54b4971bdc64df2a8e32687be888a24155c83843a92867
+size 17078327

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff