Text Generation
Transformers
Safetensors
English
iquestloopcoder
conversational
custom_code
baodoo IQuestLabBot commited on
Commit
06998f4
·
0 Parent(s):

Duplicate from IQuestLab/IQuest-Coder-V1-40B-Loop-Thinking

Browse files

Co-authored-by: IQuestLabBot <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ papers/iquest-coder-v1-logo.png filter=lfs diff=lfs merge=lfs -text
37
+ papers/results-20260302.png filter=lfs diff=lfs merge=lfs -text
38
+ papers/results.png filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
File without changes
README.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: iquestcoder
4
+ license_link: >-
5
+ https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Thinking/blob/main/LICENSE
6
+ language:
7
+ - en
8
+ library_name: transformers
9
+ ---
10
+
11
+
12
+ ![Evaluation Results](./papers/iquest-coder-v1-logo.png)
13
+
14
+ <p align="center">
15
+ 📘 <a href="https://iquestlab.github.io">Blog (2026-01-01)</a >
16
+ &nbsp;•&nbsp;
17
+ 📘 <a href="https://iquestlab.github.io/release-1.0-2603/index.html">Blog (2026-03-02)</a >
18
+ &nbsp;•&nbsp;
19
+ 📄 <a href="https://github.com/IQuestLab/IQuest-Coder-V1/blob/main/papers/IQuest_Coder_Technical_Report.pdf">Technical Report</a >
20
+ </p >
21
+
22
+ # IQuest-Coder-V1 Model Family Update
23
+
24
+ 🚀🚀🚀 [IQuest-Coder-V1 Model Family Update](https://iquestlab.github.io/release-1.0-2603/index.html): Released 7B & 14B Family Models, 40B-Thinking and 40B-Loop-Thinking, specially optimized for tool use, CLI agents (Like `Claude Code` and `OpenCode`) & HTML/SVG generation, all with 128K context, now on Hugging Face!
25
+
26
+ ## 7B Models
27
+
28
+ | Model | Link |
29
+ |-------|------|
30
+ | IQuest-Coder-V1-7B-Base-Stage1 | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-7B-Base-Stage1) |
31
+ | IQuest-Coder-V1-7B-Base | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-7B-Base) |
32
+ | IQuest-Coder-V1-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-7B-Instruct) |
33
+ | IQuest-Coder-V1-7B-Thinking | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-7B-Thinking) |
34
+
35
+ ## 14B Models
36
+
37
+ | Model | Link |
38
+ |-------|------|
39
+ | IQuest-Coder-V1-14B-Base-Stage1 | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-14B-Base-Stage1) |
40
+ | IQuest-Coder-V1-14B-Base | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-14B-Base) |
41
+ | IQuest-Coder-V1-14B-Instruct | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-14B-Instruct) |
42
+ | IQuest-Coder-V1-14B-Thinking | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-14B-Thinking) |
43
+
44
+ ## 40B Models
45
+
46
+ | Model | Link |
47
+ |-------|------|
48
+ | IQuest-Coder-V1-40B-Base-Stage1 | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Base-Stage1) |
49
+ | IQuest-Coder-V1-40B-Base | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Base) |
50
+ | IQuest-Coder-V1-40B-Instruct | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Instruct) |
51
+ | IQuest-Coder-V1-40B-Loop-Instruct | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct) |
52
+ | IQuest-Coder-V1-40B-Thinking | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Thinking) |
53
+ | IQuest-Coder-V1-40B-Loop-Thinking | [🤗 Hugging Face](https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Thinking) |
54
+
55
+ ## Sampling Parameters:
56
+ For the IQuest-Coder-V1-Instruct: We suggest using Temperature=0.6, TopP=0.85, TopK=20.
57
+
58
+ For the IQuest-Coder-V1-Thinking: We suggest using Temperature=1.0, TopP=0.95, TopK=20.
59
+
60
+
61
+ ## IQuest-Coder-V1 Highlights
62
+
63
+ IQuest-Coder-V1 is a new family of code large language models (LLMs) designed to advance autonomous software engineering and code intelligence. Built on the innovative code-flow multi-stage training paradigm, IQuest-Coder-V1 captures the dynamic evolution of software logic, delivering state-of-the-art performance across critical dimensions:
64
+
65
+ - **Performance**: Achieves leading results on SWE-Bench Verified (76.2%), BigCodeBench (49.9%), LiveCodeBench v6 (81.1%), and other major coding benchmarks, surpassing competitive models across agentic software engineering, competitive programming, and complex tool use.
66
+ - **Code-Flow Training Paradigm**: Moving beyond static code representations, our models learn from repository evolution patterns, commit transitions, and dynamic code transformations to understand real-world software development processes.
67
+ - **Dual Specialization Paths**: Bifurcated post-training delivers two specialized variants—Thinking models (utilizing reasoning-driven RL for complex problem-solving) and Instruct models (optimized for general coding assistance and instruction-following).
68
+ - **Efficient Architecture**: The IQuest-Coder-V1-Loop variant introduces a recurrent mechanism that optimizes the trade-off between model capacity and deployment footprint. The 7B and 14B models adopt shallow architectures for faster inference speed.
69
+ - **Native Long Context**: All models natively support up to 128K tokens without requiring additional scaling techniques.
70
+ - **CLI Agent Integration**: Demonstrates initial deployment capabilities on ClaudeCode and OpenCode platforms, with the ability to integrate into CLI-based agent workflows.
71
+ - **HTML and SVG Generation**: Features preliminary support for HTML and SVG code generation.
72
+ - **Architectural Chain-of-Thought via Recurrent Depth**: 40B-Loop-Thinking is a research-oriented, experimental model prototype designed to explore how structural chains of thought and procedural chains of thought can be combined within a single system. The model uniquely integrates structural chains of thought—realized through loop-based computation enabled by the dual-iteration LoopCoder architecture—with procedural chains of thought derived from explicit reasoning trajectories trained via reinforcement learning. Unlike standard reasoning models that rely solely on token-level chain-of-thought expansion, Loop-Thinking introduces implicit multi-step computation at the architectural level through a looped Transformer design. In this design, the second iteration refines the hidden states produced by the first iteration using a global–local attention gating mechanism. This results in a nested reasoning mechanism: the loop structure supports iterative representation refinement, while the reasoning-oriented training paradigm injects explicit problem decomposition behavior. It is important to note that this model is not intended to achieve state-of-the-art performance across benchmarks, but rather to validate the complementary roles of loop-based computation and reasoning-oriented training in shaping reasoning structures, and to provide experimental evidence for future model design.
73
+
74
+
75
+ ## Model Overview
76
+
77
+ The IQuest-Coder-V1 series includes models ranging from 7B to 40B parameters, with both standard and Loop variants:
78
+
79
+ | Model | Parameters | Layers | Hidden Size | Attention Heads (Q/KV) | Context Length |
80
+ |-------|------------|--------|-------------|------------------------|----------------|
81
+ | IQuest-Coder-V1-7B-Instruct | 7B | 14 | 5120 | 40/8 | 128K |
82
+ | IQuest-Coder-V1-7B-Thinking | 7B | 14 | 5120 | 40/8 | 128K |
83
+ | IQuest-Coder-V1-14B-Instruct | 14B | 28 | 5120 | 40/8 | 128K |
84
+ | IQuest-Coder-V1-14B-Thinking | 14B | 28 | 5120 | 40/8 | 128K |
85
+ | IQuest-Coder-V1-40B-Instruct | 40B | 80 | 5120 | 40/8 | 128K |
86
+ | IQuest-Coder-V1-40B-Thinking | 40B | 80 | 5120 | 40/8 | 128K |
87
+ | IQuest-Coder-V1-40B-Loop-Instruct | 40B | 80 (2 iterations) | 5120 | 40/8 | 128K |
88
+ | IQuest-Coder-V1-40B-Loop-Thinking | 40B | 80 (2 iterations) | 5120 | 40/8 | 128K |
89
+
90
+ **Architecture Features:**
91
+
92
+ - Grouped Query Attention (GQA) for efficient inference
93
+ - Native 128K context length support
94
+ - Vocabulary size: 76,800 tokens
95
+ - Loop variants use recurrent transformer design with shared parameters across two iterations
96
+
97
+ For more details, please refer to our Technical Report, GitHub.
98
+
99
+ ## Quickstart
100
+
101
+ IQuest-Coder-V1 uses custom modeling code via Hugging Face's auto_map feature. We recommend using transformers>=4.52.4.
102
+
103
+ ### Basic Usage with Transformers
104
+
105
+ ```python
106
+ from transformers import AutoModelForCausalLM, AutoTokenizer
107
+
108
+ model_name = "IQuest/IQuest-Coder-V1-40B-Instruct"
109
+
110
+ # Load the tokenizer and model
111
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
112
+ model = AutoModelForCausalLM.from_pretrained(
113
+ model_name,
114
+ torch_dtype="auto",
115
+ device_map="auto"
116
+ )
117
+
118
+ # Prepare the input
119
+ prompt = "Write a Python function to calculate the Fibonacci sequence using dynamic programming."
120
+ messages = [
121
+ {"role": "user", "content": prompt}
122
+ ]
123
+ text = tokenizer.apply_chat_template(
124
+ messages,
125
+ tokenize=False,
126
+ add_generation_prompt=True
127
+ )
128
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
129
+
130
+ # Generate response
131
+ generated_ids = model.generate(
132
+ **model_inputs,
133
+ max_new_tokens=8192
134
+ )
135
+ generated_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
136
+ response = tokenizer.decode(generated_ids, skip_special_tokens=True)
137
+
138
+ print(response)
139
+ ```
140
+
141
+ ### Using Thinking Models
142
+
143
+ For complex reasoning tasks, use the Thinking variant:
144
+
145
+ ```python
146
+ model_name = "IQuestLab/IQuest-Coder-V1-40B-Thinking"
147
+
148
+ # The Thinking model includes explicit reasoning traces
149
+ # Use similar code as above, but expect longer, more detailed responses
150
+ # with step-by-step problem decomposition
151
+ ```
152
+
153
+ ### Deployment with vLLM
154
+
155
+ For production deployment, you can use vLLM to create an OpenAI-compatible API endpoint. Please refer to the [vLLM PR](https://github.com/vllm-project/vllm/pull/31575/files) for implementation details.
156
+
157
+ ```bash
158
+ vllm serve IQuestLab/IQuest-Coder-V1-40B-Instruct --tensor-parallel-size 8
159
+ ```
160
+
161
+ For Thinking models with reasoning support:
162
+
163
+ ```bash
164
+ vllm serve IQuestLab/IQuest-Coder-V1-40B-Thinking --reasoning-parser qwen3 --tensor-parallel-size 8
165
+ ```
166
+
167
+ When using tool, `IQuest-Coder-V1-40B-Instruct` and `IQuest-Coder-V1-40B-Loop-Instruct` should use `--tool-parser qwen3`, while `IQuest-Coder-V1-7B-Instruct`, `IQuest-Coder-V1-7B-Thinking`, `IQuest-Coder-V1-14B-Instruct`, `IQuest-Coder-V1-14B-Thinking`, `IQuest-Coder-V1-40B-Thinking` and `IQuest-Coder-V1-40B-Loop-Thinking` should use `--tool-parser qwen3_coder`.
168
+
169
+ ### CLI-Like Agents and Tools Usage
170
+
171
+ CLI-like agent capabilities are available for the following models: `IQuest-Coder-V1-7B-Instruct`, `IQuest-Coder-V1-7B-Thinking`, `IQuest-Coder-V1-14B-Instruct`, `IQuest-Coder-V1-14B-Thinking`, `IQuest-Coder-V1-40B-Thinking` and `IQuest-Coder-V1-40B-Loop-Thinking`.
172
+
173
+ **Step 1:** Deploy the model with vLLM and set tool parser (**Attention: Do not set reasoning parser for Instruct LLMs, otherwise it will cause unexpected errors**):
174
+
175
+ ```bash
176
+ vllm serve IQuestLab/IQuest-Coder-V1-7B-Instruct --tool-parser qwen3_coder
177
+ ```
178
+
179
+ or
180
+
181
+ ```bash
182
+ vllm serve IQuestLab/IQuest-Coder-V1-7B-Thinking --tool-parser qwen3_coder --reasoning-parser qwen3
183
+ ```
184
+
185
+ **Step 2:** Use Claude Code to enjoy it:
186
+
187
+ ```bash
188
+ export ANTHROPIC_BASE_URL="http://iquestcoder.link"
189
+ export ANTHROPIC_AUTH_TOKEN="sk-iquestcoder"
190
+ claude --model IQuestCoder-V1-7B-Instruct
191
+ ```
192
+
193
+
194
+ ## Evaluation Results
195
+
196
+ ![Evaluation Results](./papers/results-20260302.png)
197
+
198
+ ![Evaluation Results](./papers/results.png)
199
+
200
+ ### Benchmark Parameters
201
+
202
+ | Benchmark | Temperature | Top_p |
203
+ | :--- | :--- | :--- |
204
+ | **Evalplus-HumanEval** | 0.0 | - |
205
+ | **Evalplus-MBPP** | 0.0 | - |
206
+ | **BigCodeBench** | 0.0 | - |
207
+ | **FullStackBench** | 0.0 | - |
208
+ | **CruxEval** | 0.0 | - |
209
+ | **LiveCodeBench** | 0.6 | 0.95 |
210
+ | **Aider-Polyglot** | 0.95 | 0.85 |
211
+ | **Mercury** | 0.2 | 0.85 |
212
+ | **Bird** | 0.2 | 0.95 |
213
+ | **Spider** | 0.2 | 0.95 |
214
+ | **Terminal-Bench** | 0.0 | - |
215
+ | **Terminal-Bench (2.0)** | 0.7 | 1.0 |
216
+ | **SWE-Verified** | 0.0 | - |
217
+ | **BFCL V3** | 0.01 | 0.85 |
218
+ | **Mind2Web** | 0.0 | - |
219
+
220
+ ### SWE-Bench Verified Evaluation
221
+
222
+ We provide the evaluation framework and trajectory data for reproducing our SWE-Bench Verified results in `IQuest-Coder-Eval/SWE-Verified/`.
223
+
224
+ The evaluation framework is based on [R2E-Gym](https://github.com/R2E-Gym/R2E-Gym). To reproduce the evaluation:
225
+
226
+ ```bash
227
+ cd IQuest-Coder-Eval/SWE-Verified/R2E-Gym
228
+
229
+ # Install dependencies
230
+ pip install -e .
231
+
232
+ # Run evaluation
233
+ bash benchmark/bench/loopcoder/loopcoder.sh
234
+ ```
235
+
236
+ The trajectory file `./IQuest-Coder-Eval/SWE-Verified/traj.zip` contains the complete agent trajectories for our SWE-Bench Verified evaluation.
237
+
238
+ ## Limitations
239
+
240
+ - **Research Prototype**: The current models are designed for research purposes. Real-world user experience may differ from state-of-the-art commercial models, with weaker instruction-following capabilities in certain scenarios.
241
+ - **Long-Context Management**: Due to parameter size constraints, performance on long-horizon tasks and multi-turn tool invocations is limited, particularly in scenarios requiring sustained context management and complex agentic workflows.
242
+ - **Reasoning vs. Efficiency Trade-off**: Thinking models provide superior reasoning but generate longer responses; Instruct models are more efficient for straightforward tasks.
243
+ - **Code Execution**: Models generate code but do not execute it; always validate outputs in sandboxed environments.
244
+ - **Domain Specificity**: While trained on diverse codebases, performance may vary on highly specialized or proprietary frameworks.
245
+ - **Factuality**: Models may generate plausible but incorrect code; verify critical implementations thoroughly.
246
+
247
+ ## Citation
248
+
249
+ If you find our work helpful, please cite:
250
+
251
+ ```bibtex
252
+ @article{iquest-coder-v1-2025,
253
+ title={IQuest-Coder-V1 Technical Report},
254
+ author={IQuest Coder Team},
255
+ url={https://github.com/IQuestLab/IQuest-Coder-V1/blob/main/papers/IQuest_Coder_Technical_Report.pdf}
256
+ year={2025}
257
+ }
258
+ @article{codescaling,
259
+ title={Scaling Laws for Code: Every Programming Language Matters},
260
+ author={Yang, Jian and Guo, Shawn and Jing, Lin and Zhang, Wei and Liu, Aishan and Hao, Chuan and Li, Zhoujun and Zhao, Wayne Xin and Liu, Xianglong and Lv, Weifeng and others},
261
+ journal={arXiv preprint arXiv:2512.13472},
262
+ year={2025}
263
+ }
264
+ @article{close_the_loop,
265
+ title={Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing},
266
+ author={Yuwen Li, Wei Zhang, Zelong Huang, Mason Yang, Jiajun Wu, Shawn Guo, Huahao Hu, Lingyi Sun, Jian Yang, Mingjie Tang, Byran Dai},
267
+ journal={arXiv preprint arXiv:2512.23611},
268
+ year={2025}
269
+ }
270
+ @article{loopcoder,
271
+ title={LoopCoder: Scaling Code Intelligence via Looped Language Models},
272
+ author={Jian Yang, Wei Zhang, Shawn Guo, Yizhi Li, Lin Jing, Zhengmao Ye, Shark Liu, Yuyang Song, Jiajun Wu, Che Liu, T. Zheng, Siwei Wu, L. Liao, X. Ma, Chuan Hao, Ran Tao, Yan Xing, Jianzhou Wang, Mingjie Tang, Aishan Liu, Zhoujun Li, Xianglong Liu, Weifeng Lv1, Bryan Dai},
273
+ year={2025}
274
+ }
275
+ @article{swe_compress,
276
+ title={Context as a Tool: Context Management for Long-Horizon SWE-Agents},
277
+ author={hukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, Bryan Dai},
278
+ journal={arXiv preprint arXiv:2512.22087},
279
+ year={2025}
280
+ }
281
+ ```
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "iquestloopcoder",
3
+ "architectures": [
4
+ "IQuestLoopCoderForCausalLM"
5
+ ],
6
+ "model_type": "iquestloopcoder",
7
+ "vocab_size": 76800,
8
+ "hidden_size": 5120,
9
+ "intermediate_size": 27648,
10
+ "num_hidden_layers": 80,
11
+ "eos_token_id": [2, 75864, 75869],
12
+ "num_attention_heads": 40,
13
+ "num_key_value_heads": 8,
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "max_position_embeddings": 131072,
17
+ "initializer_range": 0.02,
18
+ "rms_norm_eps": 1e-05,
19
+ "use_cache": true,
20
+ "tie_word_embeddings": false,
21
+ "rope_theta": 500000,
22
+ "attention_bias": false,
23
+ "attention_dropout": 0.0,
24
+ "mlp_bias": false,
25
+ "loop_num": 2,
26
+ "loop_window_size": 8192,
27
+ "torch_dtype": "bfloat16",
28
+ "transformers_version": "4.55.4",
29
+ "auto_map": {
30
+ "AutoConfig": "configuration_iquestloopcoder.IQuestLoopCoderConfig",
31
+ "AutoModel": "modeling_iquestloopcoder.IQuestLoopCoderModel",
32
+ "AutoModelForCausalLM": "modeling_iquestloopcoder.IQuestLoopCoderForCausalLM"
33
+ }
34
+ }
35
+
configuration_iquestloopcoder.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2024 IQuestLoopCoder Authors
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ """IQuestLoopCoder model configuration"""
6
+
7
+ from transformers.configuration_utils import PretrainedConfig
8
+ from transformers.utils import logging
9
+
10
+ logger = logging.get_logger(__name__)
11
+
12
+
13
+ class IQuestLoopCoderConfig(PretrainedConfig):
14
+ r"""
15
+ Configuration class for IQuestLoopCoder model.
16
+
17
+ IQuestLoopCoder extends the standard LLaMA architecture with a loop mechanism:
18
+ - Loop 1: Standard attention, stores K1, V1
19
+ - Loop 2+: Mixed attention with gated combination of global (K1,V1) and local (K2,V2) KV
20
+
21
+ The gate is computed as: gate = sigmoid(W @ Q + bias)
22
+ Mixed output = gate * Attention(Q, K1, V1) + (1 - gate) * SlidingWindowAttention(Q, K2, V2)
23
+
24
+ Args:
25
+ vocab_size (`int`, *optional*, defaults to 76800):
26
+ Vocabulary size of the model.
27
+ hidden_size (`int`, *optional*, defaults to 5120):
28
+ Dimension of the hidden representations.
29
+ intermediate_size (`int`, *optional*, defaults to 27648):
30
+ Dimension of the MLP representations (FFN hidden size).
31
+ num_hidden_layers (`int`, *optional*, defaults to 80):
32
+ Number of hidden layers in the Transformer decoder.
33
+ num_attention_heads (`int`, *optional*, defaults to 40):
34
+ Number of attention heads for each attention layer.
35
+ num_key_value_heads (`int`, *optional*, defaults to 8):
36
+ Number of key-value heads (for GQA). If None, defaults to num_attention_heads.
37
+ head_dim (`int`, *optional*, defaults to 128):
38
+ Dimension of each attention head (hidden_size // num_attention_heads).
39
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
40
+ Activation function in the MLP.
41
+ max_position_embeddings (`int`, *optional*, defaults to 8192):
42
+ Maximum sequence length.
43
+ initializer_range (`float`, *optional*, defaults to 0.02):
44
+ Standard deviation for weight initialization.
45
+ rms_norm_eps (`float`, *optional*, defaults to 1e-5):
46
+ Epsilon for RMS normalization layers.
47
+ use_cache (`bool`, *optional*, defaults to `True`):
48
+ Whether to use past key/values for generation.
49
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
50
+ Whether to tie input and output embeddings.
51
+ rope_theta (`float`, *optional*, defaults to 500000.0):
52
+ Base value for rotary position embeddings.
53
+ attention_bias (`bool`, *optional*, defaults to `False`):
54
+ Whether to use bias in attention layers.
55
+ attention_dropout (`float`, *optional*, defaults to 0.0):
56
+ Dropout ratio for attention weights.
57
+ mlp_bias (`bool`, *optional*, defaults to `False`):
58
+ Whether to use bias in MLP layers.
59
+
60
+ # Loop-specific parameters
61
+ loop_num (`int`, *optional*, defaults to 2):
62
+ Number of loops through the decoder.
63
+ loop_window_size (`int`, *optional*, defaults to 64):
64
+ Window size for sliding window attention in Loop 2+.
65
+ """
66
+
67
+ model_type = "iquestloopcoder"
68
+ keys_to_ignore_at_inference = ["past_key_values"]
69
+
70
+ def __init__(
71
+ self,
72
+ vocab_size=76800,
73
+ hidden_size=5120,
74
+ intermediate_size=27648,
75
+ num_hidden_layers=80,
76
+ num_attention_heads=40,
77
+ num_key_value_heads=8,
78
+ head_dim=128,
79
+ hidden_act="silu",
80
+ max_position_embeddings=8192,
81
+ initializer_range=0.02,
82
+ rms_norm_eps=1e-5,
83
+ use_cache=True,
84
+ pad_token_id=None,
85
+ bos_token_id=1,
86
+ eos_token_id=2,
87
+ tie_word_embeddings=False,
88
+ rope_theta=500000.0,
89
+ rope_scaling=None,
90
+ attention_bias=False,
91
+ attention_dropout=0.0,
92
+ mlp_bias=False,
93
+ # Loop-specific parameters
94
+ loop_num=2,
95
+ loop_window_size=64,
96
+ **kwargs,
97
+ ):
98
+ self.vocab_size = vocab_size
99
+ self.max_position_embeddings = max_position_embeddings
100
+ self.hidden_size = hidden_size
101
+ self.intermediate_size = intermediate_size
102
+ self.num_hidden_layers = num_hidden_layers
103
+ self.num_attention_heads = num_attention_heads
104
+ self.head_dim = head_dim
105
+
106
+ # GQA support
107
+ if num_key_value_heads is None:
108
+ num_key_value_heads = num_attention_heads
109
+ self.num_key_value_heads = num_key_value_heads
110
+
111
+ self.hidden_act = hidden_act
112
+ self.initializer_range = initializer_range
113
+ self.rms_norm_eps = rms_norm_eps
114
+ self.use_cache = use_cache
115
+ self.rope_theta = rope_theta
116
+ self.rope_scaling = rope_scaling
117
+ self.attention_bias = attention_bias
118
+ self.attention_dropout = attention_dropout
119
+ self.mlp_bias = mlp_bias
120
+
121
+ # Loop-specific
122
+ self.loop_num = loop_num
123
+ self.loop_window_size = loop_window_size
124
+
125
+ super().__init__(
126
+ pad_token_id=pad_token_id,
127
+ bos_token_id=bos_token_id,
128
+ eos_token_id=eos_token_id,
129
+ tie_word_embeddings=tie_word_embeddings,
130
+ **kwargs,
131
+ )
132
+
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": [2, 75864, 75869],
5
+ "transformers_version": "4.55.4"
6
+ }
model-00001-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2be32f005e06779244d573d9663c52eab02ab2d9383f83f647f05cdd0c13562b
3
+ size 4090342608
model-00002-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9baafcbaa4dec207bd959f7afa580da9152600db080f5a689d04705524e45077
3
+ size 4246829624
model-00003-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c47664af8e35bf6fd529f619b3baea67e3718ecad37fa6f76fcbe512c6a565a
3
+ size 4246829608
model-00004-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5191d345cfc8420a95739f093ce2ff4dcf4d7522b77ad322e73088e8e838962e
3
+ size 4183904456
model-00005-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9edd4ae7d1f8fa5503c8a0a8602c00ac81484d19c70083ff0d7dbd383efaf771
3
+ size 4246829624
model-00006-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9db4459d916f919baaeddb54bbe7f709e78bc79e89b312876e11e098e6a3d284
3
+ size 4246829600
model-00007-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4f287fee36ac2bdf6b5851f9e19fe58f46d235b48a2efd123fb42e893aed2d9e
3
+ size 4183904464
model-00008-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a254fdb83c2fc1f72ae40a88b5dca06b5a48e70e4ad2f94bcefb58a242a6c549
3
+ size 4246829624
model-00009-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17f91d0d4662dc1272c4df60e0cf26493c5ddc95183fbb39727687506abeb9bf
3
+ size 4246829600
model-00010-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61ec5a9ce43abde2a32f6c73ad363528542b71fba0d9c307a64061f2855e9624
3
+ size 4183904464
model-00011-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1362aa67d05030433025956e4a426c1516fd2adc9ba1622a74eeb28bac88f932
3
+ size 4246829616
model-00012-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e007afb8f4b838322aa7cc83fb65add2701e7837ef8f17344bd300f59f73a090
3
+ size 4246829608
model-00013-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e05f4913d32bae723cea8f640aaad575c168da0870a0376d1b0f302b55c1c874
3
+ size 4183904464
model-00014-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39ab757fae9234cce931a1f9a3c70c65b59890a5a68076bd68002b154955e365
3
+ size 4246829608
model-00015-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a542e8deec95213bac5bc2eccd6cf0dbb7f5221592dfec8891ba8d2545385608
3
+ size 4246829608
model-00016-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b11176e8479dfe738d2ba8dfc3a00e4ef9b07938c77d8e4ce79ccb191a8733e
3
+ size 4183904464
model-00017-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7244bfc6c427d5f2f794868cacca54da803b644dab91b00d473f24901bc1ecdb
3
+ size 4246829616
model-00018-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9add3e39e92472082c78d8f33e1c5c90f574eb5557a3a5526afb52952a909c18
3
+ size 4246829608
model-00019-of-00019.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b39fc351334bba0d376baebf56c085cc9391935d5c4f9cc87cfa22e84f46821b
3
+ size 3617673152
model.safetensors.index.json ADDED
@@ -0,0 +1,890 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 79589392640
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00001-of-00019.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00019.safetensors",
8
+ "model.gate_projections.0.bias": "model-00001-of-00019.safetensors",
9
+ "model.gate_projections.0.weight": "model-00001-of-00019.safetensors",
10
+ "model.gate_projections.1.bias": "model-00001-of-00019.safetensors",
11
+ "model.gate_projections.1.weight": "model-00001-of-00019.safetensors",
12
+ "model.gate_projections.10.bias": "model-00001-of-00019.safetensors",
13
+ "model.gate_projections.10.weight": "model-00001-of-00019.safetensors",
14
+ "model.gate_projections.11.bias": "model-00001-of-00019.safetensors",
15
+ "model.gate_projections.11.weight": "model-00001-of-00019.safetensors",
16
+ "model.gate_projections.12.bias": "model-00001-of-00019.safetensors",
17
+ "model.gate_projections.12.weight": "model-00001-of-00019.safetensors",
18
+ "model.gate_projections.13.bias": "model-00001-of-00019.safetensors",
19
+ "model.gate_projections.13.weight": "model-00001-of-00019.safetensors",
20
+ "model.gate_projections.14.bias": "model-00001-of-00019.safetensors",
21
+ "model.gate_projections.14.weight": "model-00001-of-00019.safetensors",
22
+ "model.gate_projections.15.bias": "model-00001-of-00019.safetensors",
23
+ "model.gate_projections.15.weight": "model-00001-of-00019.safetensors",
24
+ "model.gate_projections.16.bias": "model-00001-of-00019.safetensors",
25
+ "model.gate_projections.16.weight": "model-00001-of-00019.safetensors",
26
+ "model.gate_projections.17.bias": "model-00001-of-00019.safetensors",
27
+ "model.gate_projections.17.weight": "model-00001-of-00019.safetensors",
28
+ "model.gate_projections.18.bias": "model-00001-of-00019.safetensors",
29
+ "model.gate_projections.18.weight": "model-00001-of-00019.safetensors",
30
+ "model.gate_projections.19.bias": "model-00001-of-00019.safetensors",
31
+ "model.gate_projections.19.weight": "model-00001-of-00019.safetensors",
32
+ "model.gate_projections.2.bias": "model-00001-of-00019.safetensors",
33
+ "model.gate_projections.2.weight": "model-00001-of-00019.safetensors",
34
+ "model.gate_projections.20.bias": "model-00001-of-00019.safetensors",
35
+ "model.gate_projections.20.weight": "model-00001-of-00019.safetensors",
36
+ "model.gate_projections.21.bias": "model-00001-of-00019.safetensors",
37
+ "model.gate_projections.21.weight": "model-00001-of-00019.safetensors",
38
+ "model.gate_projections.22.bias": "model-00001-of-00019.safetensors",
39
+ "model.gate_projections.22.weight": "model-00001-of-00019.safetensors",
40
+ "model.gate_projections.23.bias": "model-00001-of-00019.safetensors",
41
+ "model.gate_projections.23.weight": "model-00001-of-00019.safetensors",
42
+ "model.gate_projections.24.bias": "model-00001-of-00019.safetensors",
43
+ "model.gate_projections.24.weight": "model-00001-of-00019.safetensors",
44
+ "model.gate_projections.25.bias": "model-00001-of-00019.safetensors",
45
+ "model.gate_projections.25.weight": "model-00001-of-00019.safetensors",
46
+ "model.gate_projections.26.bias": "model-00001-of-00019.safetensors",
47
+ "model.gate_projections.26.weight": "model-00001-of-00019.safetensors",
48
+ "model.gate_projections.27.bias": "model-00001-of-00019.safetensors",
49
+ "model.gate_projections.27.weight": "model-00001-of-00019.safetensors",
50
+ "model.gate_projections.28.bias": "model-00001-of-00019.safetensors",
51
+ "model.gate_projections.28.weight": "model-00001-of-00019.safetensors",
52
+ "model.gate_projections.29.bias": "model-00001-of-00019.safetensors",
53
+ "model.gate_projections.29.weight": "model-00001-of-00019.safetensors",
54
+ "model.gate_projections.3.bias": "model-00001-of-00019.safetensors",
55
+ "model.gate_projections.3.weight": "model-00001-of-00019.safetensors",
56
+ "model.gate_projections.30.bias": "model-00001-of-00019.safetensors",
57
+ "model.gate_projections.30.weight": "model-00001-of-00019.safetensors",
58
+ "model.gate_projections.31.bias": "model-00001-of-00019.safetensors",
59
+ "model.gate_projections.31.weight": "model-00001-of-00019.safetensors",
60
+ "model.gate_projections.32.bias": "model-00001-of-00019.safetensors",
61
+ "model.gate_projections.32.weight": "model-00001-of-00019.safetensors",
62
+ "model.gate_projections.33.bias": "model-00001-of-00019.safetensors",
63
+ "model.gate_projections.33.weight": "model-00001-of-00019.safetensors",
64
+ "model.gate_projections.34.bias": "model-00001-of-00019.safetensors",
65
+ "model.gate_projections.34.weight": "model-00001-of-00019.safetensors",
66
+ "model.gate_projections.35.bias": "model-00001-of-00019.safetensors",
67
+ "model.gate_projections.35.weight": "model-00001-of-00019.safetensors",
68
+ "model.gate_projections.36.bias": "model-00001-of-00019.safetensors",
69
+ "model.gate_projections.36.weight": "model-00001-of-00019.safetensors",
70
+ "model.gate_projections.37.bias": "model-00001-of-00019.safetensors",
71
+ "model.gate_projections.37.weight": "model-00001-of-00019.safetensors",
72
+ "model.gate_projections.38.bias": "model-00001-of-00019.safetensors",
73
+ "model.gate_projections.38.weight": "model-00001-of-00019.safetensors",
74
+ "model.gate_projections.39.bias": "model-00001-of-00019.safetensors",
75
+ "model.gate_projections.39.weight": "model-00001-of-00019.safetensors",
76
+ "model.gate_projections.4.bias": "model-00001-of-00019.safetensors",
77
+ "model.gate_projections.4.weight": "model-00001-of-00019.safetensors",
78
+ "model.gate_projections.40.bias": "model-00001-of-00019.safetensors",
79
+ "model.gate_projections.40.weight": "model-00001-of-00019.safetensors",
80
+ "model.gate_projections.41.bias": "model-00001-of-00019.safetensors",
81
+ "model.gate_projections.41.weight": "model-00001-of-00019.safetensors",
82
+ "model.gate_projections.42.bias": "model-00001-of-00019.safetensors",
83
+ "model.gate_projections.42.weight": "model-00001-of-00019.safetensors",
84
+ "model.gate_projections.43.bias": "model-00001-of-00019.safetensors",
85
+ "model.gate_projections.43.weight": "model-00001-of-00019.safetensors",
86
+ "model.gate_projections.44.bias": "model-00001-of-00019.safetensors",
87
+ "model.gate_projections.44.weight": "model-00001-of-00019.safetensors",
88
+ "model.gate_projections.45.bias": "model-00001-of-00019.safetensors",
89
+ "model.gate_projections.45.weight": "model-00001-of-00019.safetensors",
90
+ "model.gate_projections.46.bias": "model-00001-of-00019.safetensors",
91
+ "model.gate_projections.46.weight": "model-00001-of-00019.safetensors",
92
+ "model.gate_projections.47.bias": "model-00001-of-00019.safetensors",
93
+ "model.gate_projections.47.weight": "model-00001-of-00019.safetensors",
94
+ "model.gate_projections.48.bias": "model-00001-of-00019.safetensors",
95
+ "model.gate_projections.48.weight": "model-00001-of-00019.safetensors",
96
+ "model.gate_projections.49.bias": "model-00001-of-00019.safetensors",
97
+ "model.gate_projections.49.weight": "model-00001-of-00019.safetensors",
98
+ "model.gate_projections.5.bias": "model-00001-of-00019.safetensors",
99
+ "model.gate_projections.5.weight": "model-00001-of-00019.safetensors",
100
+ "model.gate_projections.50.bias": "model-00001-of-00019.safetensors",
101
+ "model.gate_projections.50.weight": "model-00001-of-00019.safetensors",
102
+ "model.gate_projections.51.bias": "model-00001-of-00019.safetensors",
103
+ "model.gate_projections.51.weight": "model-00001-of-00019.safetensors",
104
+ "model.gate_projections.52.bias": "model-00001-of-00019.safetensors",
105
+ "model.gate_projections.52.weight": "model-00001-of-00019.safetensors",
106
+ "model.gate_projections.53.bias": "model-00001-of-00019.safetensors",
107
+ "model.gate_projections.53.weight": "model-00001-of-00019.safetensors",
108
+ "model.gate_projections.54.bias": "model-00001-of-00019.safetensors",
109
+ "model.gate_projections.54.weight": "model-00001-of-00019.safetensors",
110
+ "model.gate_projections.55.bias": "model-00001-of-00019.safetensors",
111
+ "model.gate_projections.55.weight": "model-00001-of-00019.safetensors",
112
+ "model.gate_projections.56.bias": "model-00001-of-00019.safetensors",
113
+ "model.gate_projections.56.weight": "model-00001-of-00019.safetensors",
114
+ "model.gate_projections.57.bias": "model-00001-of-00019.safetensors",
115
+ "model.gate_projections.57.weight": "model-00001-of-00019.safetensors",
116
+ "model.gate_projections.58.bias": "model-00001-of-00019.safetensors",
117
+ "model.gate_projections.58.weight": "model-00001-of-00019.safetensors",
118
+ "model.gate_projections.59.bias": "model-00001-of-00019.safetensors",
119
+ "model.gate_projections.59.weight": "model-00001-of-00019.safetensors",
120
+ "model.gate_projections.6.bias": "model-00001-of-00019.safetensors",
121
+ "model.gate_projections.6.weight": "model-00001-of-00019.safetensors",
122
+ "model.gate_projections.60.bias": "model-00001-of-00019.safetensors",
123
+ "model.gate_projections.60.weight": "model-00001-of-00019.safetensors",
124
+ "model.gate_projections.61.bias": "model-00001-of-00019.safetensors",
125
+ "model.gate_projections.61.weight": "model-00001-of-00019.safetensors",
126
+ "model.gate_projections.62.bias": "model-00001-of-00019.safetensors",
127
+ "model.gate_projections.62.weight": "model-00001-of-00019.safetensors",
128
+ "model.gate_projections.63.bias": "model-00001-of-00019.safetensors",
129
+ "model.gate_projections.63.weight": "model-00001-of-00019.safetensors",
130
+ "model.gate_projections.64.bias": "model-00001-of-00019.safetensors",
131
+ "model.gate_projections.64.weight": "model-00001-of-00019.safetensors",
132
+ "model.gate_projections.65.bias": "model-00001-of-00019.safetensors",
133
+ "model.gate_projections.65.weight": "model-00001-of-00019.safetensors",
134
+ "model.gate_projections.66.bias": "model-00001-of-00019.safetensors",
135
+ "model.gate_projections.66.weight": "model-00001-of-00019.safetensors",
136
+ "model.gate_projections.67.bias": "model-00001-of-00019.safetensors",
137
+ "model.gate_projections.67.weight": "model-00001-of-00019.safetensors",
138
+ "model.gate_projections.68.bias": "model-00001-of-00019.safetensors",
139
+ "model.gate_projections.68.weight": "model-00001-of-00019.safetensors",
140
+ "model.gate_projections.69.bias": "model-00001-of-00019.safetensors",
141
+ "model.gate_projections.69.weight": "model-00001-of-00019.safetensors",
142
+ "model.gate_projections.7.bias": "model-00001-of-00019.safetensors",
143
+ "model.gate_projections.7.weight": "model-00001-of-00019.safetensors",
144
+ "model.gate_projections.70.bias": "model-00001-of-00019.safetensors",
145
+ "model.gate_projections.70.weight": "model-00001-of-00019.safetensors",
146
+ "model.gate_projections.71.bias": "model-00001-of-00019.safetensors",
147
+ "model.gate_projections.71.weight": "model-00001-of-00019.safetensors",
148
+ "model.gate_projections.72.bias": "model-00001-of-00019.safetensors",
149
+ "model.gate_projections.72.weight": "model-00001-of-00019.safetensors",
150
+ "model.gate_projections.73.bias": "model-00001-of-00019.safetensors",
151
+ "model.gate_projections.73.weight": "model-00001-of-00019.safetensors",
152
+ "model.gate_projections.74.bias": "model-00001-of-00019.safetensors",
153
+ "model.gate_projections.74.weight": "model-00001-of-00019.safetensors",
154
+ "model.gate_projections.75.bias": "model-00001-of-00019.safetensors",
155
+ "model.gate_projections.75.weight": "model-00001-of-00019.safetensors",
156
+ "model.gate_projections.76.bias": "model-00001-of-00019.safetensors",
157
+ "model.gate_projections.76.weight": "model-00001-of-00019.safetensors",
158
+ "model.gate_projections.77.bias": "model-00001-of-00019.safetensors",
159
+ "model.gate_projections.77.weight": "model-00001-of-00019.safetensors",
160
+ "model.gate_projections.78.bias": "model-00001-of-00019.safetensors",
161
+ "model.gate_projections.78.weight": "model-00001-of-00019.safetensors",
162
+ "model.gate_projections.79.bias": "model-00001-of-00019.safetensors",
163
+ "model.gate_projections.79.weight": "model-00001-of-00019.safetensors",
164
+ "model.gate_projections.8.bias": "model-00001-of-00019.safetensors",
165
+ "model.gate_projections.8.weight": "model-00001-of-00019.safetensors",
166
+ "model.gate_projections.9.bias": "model-00001-of-00019.safetensors",
167
+ "model.gate_projections.9.weight": "model-00001-of-00019.safetensors",
168
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00019.safetensors",
169
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00019.safetensors",
170
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00019.safetensors",
171
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00019.safetensors",
172
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00019.safetensors",
173
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00019.safetensors",
174
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00019.safetensors",
175
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00019.safetensors",
176
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00019.safetensors",
177
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00019.safetensors",
178
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00019.safetensors",
179
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00019.safetensors",
180
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00019.safetensors",
181
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00019.safetensors",
182
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00019.safetensors",
183
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00019.safetensors",
184
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00019.safetensors",
185
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00019.safetensors",
186
+ "model.layers.10.input_layernorm.weight": "model-00001-of-00019.safetensors",
187
+ "model.layers.10.mlp.down_proj.weight": "model-00001-of-00019.safetensors",
188
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00019.safetensors",
189
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00019.safetensors",
190
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00019.safetensors",
191
+ "model.layers.10.self_attn.k_proj.weight": "model-00002-of-00019.safetensors",
192
+ "model.layers.10.self_attn.o_proj.weight": "model-00002-of-00019.safetensors",
193
+ "model.layers.10.self_attn.q_proj.weight": "model-00002-of-00019.safetensors",
194
+ "model.layers.10.self_attn.v_proj.weight": "model-00002-of-00019.safetensors",
195
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00019.safetensors",
196
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00019.safetensors",
197
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00019.safetensors",
198
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00019.safetensors",
199
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00019.safetensors",
200
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00019.safetensors",
201
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00019.safetensors",
202
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00019.safetensors",
203
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00019.safetensors",
204
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00019.safetensors",
205
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00019.safetensors",
206
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00019.safetensors",
207
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00019.safetensors",
208
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00019.safetensors",
209
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00019.safetensors",
210
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00019.safetensors",
211
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00019.safetensors",
212
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00019.safetensors",
213
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00019.safetensors",
214
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00019.safetensors",
215
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00019.safetensors",
216
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00019.safetensors",
217
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00019.safetensors",
218
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00019.safetensors",
219
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00019.safetensors",
220
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00019.safetensors",
221
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00019.safetensors",
222
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00019.safetensors",
223
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00019.safetensors",
224
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00019.safetensors",
225
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00019.safetensors",
226
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00019.safetensors",
227
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00019.safetensors",
228
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00019.safetensors",
229
+ "model.layers.14.self_attn.q_proj.weight": "model-00003-of-00019.safetensors",
230
+ "model.layers.14.self_attn.v_proj.weight": "model-00003-of-00019.safetensors",
231
+ "model.layers.15.input_layernorm.weight": "model-00003-of-00019.safetensors",
232
+ "model.layers.15.mlp.down_proj.weight": "model-00003-of-00019.safetensors",
233
+ "model.layers.15.mlp.gate_proj.weight": "model-00003-of-00019.safetensors",
234
+ "model.layers.15.mlp.up_proj.weight": "model-00003-of-00019.safetensors",
235
+ "model.layers.15.post_attention_layernorm.weight": "model-00003-of-00019.safetensors",
236
+ "model.layers.15.self_attn.k_proj.weight": "model-00003-of-00019.safetensors",
237
+ "model.layers.15.self_attn.o_proj.weight": "model-00003-of-00019.safetensors",
238
+ "model.layers.15.self_attn.q_proj.weight": "model-00003-of-00019.safetensors",
239
+ "model.layers.15.self_attn.v_proj.weight": "model-00003-of-00019.safetensors",
240
+ "model.layers.16.input_layernorm.weight": "model-00003-of-00019.safetensors",
241
+ "model.layers.16.mlp.down_proj.weight": "model-00003-of-00019.safetensors",
242
+ "model.layers.16.mlp.gate_proj.weight": "model-00003-of-00019.safetensors",
243
+ "model.layers.16.mlp.up_proj.weight": "model-00003-of-00019.safetensors",
244
+ "model.layers.16.post_attention_layernorm.weight": "model-00003-of-00019.safetensors",
245
+ "model.layers.16.self_attn.k_proj.weight": "model-00003-of-00019.safetensors",
246
+ "model.layers.16.self_attn.o_proj.weight": "model-00003-of-00019.safetensors",
247
+ "model.layers.16.self_attn.q_proj.weight": "model-00003-of-00019.safetensors",
248
+ "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00019.safetensors",
249
+ "model.layers.17.input_layernorm.weight": "model-00003-of-00019.safetensors",
250
+ "model.layers.17.mlp.down_proj.weight": "model-00003-of-00019.safetensors",
251
+ "model.layers.17.mlp.gate_proj.weight": "model-00003-of-00019.safetensors",
252
+ "model.layers.17.mlp.up_proj.weight": "model-00003-of-00019.safetensors",
253
+ "model.layers.17.post_attention_layernorm.weight": "model-00003-of-00019.safetensors",
254
+ "model.layers.17.self_attn.k_proj.weight": "model-00003-of-00019.safetensors",
255
+ "model.layers.17.self_attn.o_proj.weight": "model-00003-of-00019.safetensors",
256
+ "model.layers.17.self_attn.q_proj.weight": "model-00003-of-00019.safetensors",
257
+ "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00019.safetensors",
258
+ "model.layers.18.input_layernorm.weight": "model-00003-of-00019.safetensors",
259
+ "model.layers.18.mlp.down_proj.weight": "model-00003-of-00019.safetensors",
260
+ "model.layers.18.mlp.gate_proj.weight": "model-00003-of-00019.safetensors",
261
+ "model.layers.18.mlp.up_proj.weight": "model-00003-of-00019.safetensors",
262
+ "model.layers.18.post_attention_layernorm.weight": "model-00003-of-00019.safetensors",
263
+ "model.layers.18.self_attn.k_proj.weight": "model-00003-of-00019.safetensors",
264
+ "model.layers.18.self_attn.o_proj.weight": "model-00003-of-00019.safetensors",
265
+ "model.layers.18.self_attn.q_proj.weight": "model-00003-of-00019.safetensors",
266
+ "model.layers.18.self_attn.v_proj.weight": "model-00003-of-00019.safetensors",
267
+ "model.layers.19.input_layernorm.weight": "model-00003-of-00019.safetensors",
268
+ "model.layers.19.mlp.down_proj.weight": "model-00003-of-00019.safetensors",
269
+ "model.layers.19.mlp.gate_proj.weight": "model-00004-of-00019.safetensors",
270
+ "model.layers.19.mlp.up_proj.weight": "model-00004-of-00019.safetensors",
271
+ "model.layers.19.post_attention_layernorm.weight": "model-00004-of-00019.safetensors",
272
+ "model.layers.19.self_attn.k_proj.weight": "model-00004-of-00019.safetensors",
273
+ "model.layers.19.self_attn.o_proj.weight": "model-00004-of-00019.safetensors",
274
+ "model.layers.19.self_attn.q_proj.weight": "model-00004-of-00019.safetensors",
275
+ "model.layers.19.self_attn.v_proj.weight": "model-00004-of-00019.safetensors",
276
+ "model.layers.2.input_layernorm.weight": "model-00004-of-00019.safetensors",
277
+ "model.layers.2.mlp.down_proj.weight": "model-00004-of-00019.safetensors",
278
+ "model.layers.2.mlp.gate_proj.weight": "model-00004-of-00019.safetensors",
279
+ "model.layers.2.mlp.up_proj.weight": "model-00004-of-00019.safetensors",
280
+ "model.layers.2.post_attention_layernorm.weight": "model-00004-of-00019.safetensors",
281
+ "model.layers.2.self_attn.k_proj.weight": "model-00004-of-00019.safetensors",
282
+ "model.layers.2.self_attn.o_proj.weight": "model-00004-of-00019.safetensors",
283
+ "model.layers.2.self_attn.q_proj.weight": "model-00004-of-00019.safetensors",
284
+ "model.layers.2.self_attn.v_proj.weight": "model-00004-of-00019.safetensors",
285
+ "model.layers.20.input_layernorm.weight": "model-00004-of-00019.safetensors",
286
+ "model.layers.20.mlp.down_proj.weight": "model-00004-of-00019.safetensors",
287
+ "model.layers.20.mlp.gate_proj.weight": "model-00004-of-00019.safetensors",
288
+ "model.layers.20.mlp.up_proj.weight": "model-00004-of-00019.safetensors",
289
+ "model.layers.20.post_attention_layernorm.weight": "model-00004-of-00019.safetensors",
290
+ "model.layers.20.self_attn.k_proj.weight": "model-00004-of-00019.safetensors",
291
+ "model.layers.20.self_attn.o_proj.weight": "model-00004-of-00019.safetensors",
292
+ "model.layers.20.self_attn.q_proj.weight": "model-00004-of-00019.safetensors",
293
+ "model.layers.20.self_attn.v_proj.weight": "model-00004-of-00019.safetensors",
294
+ "model.layers.21.input_layernorm.weight": "model-00004-of-00019.safetensors",
295
+ "model.layers.21.mlp.down_proj.weight": "model-00004-of-00019.safetensors",
296
+ "model.layers.21.mlp.gate_proj.weight": "model-00004-of-00019.safetensors",
297
+ "model.layers.21.mlp.up_proj.weight": "model-00004-of-00019.safetensors",
298
+ "model.layers.21.post_attention_layernorm.weight": "model-00004-of-00019.safetensors",
299
+ "model.layers.21.self_attn.k_proj.weight": "model-00004-of-00019.safetensors",
300
+ "model.layers.21.self_attn.o_proj.weight": "model-00004-of-00019.safetensors",
301
+ "model.layers.21.self_attn.q_proj.weight": "model-00004-of-00019.safetensors",
302
+ "model.layers.21.self_attn.v_proj.weight": "model-00004-of-00019.safetensors",
303
+ "model.layers.22.input_layernorm.weight": "model-00004-of-00019.safetensors",
304
+ "model.layers.22.mlp.down_proj.weight": "model-00004-of-00019.safetensors",
305
+ "model.layers.22.mlp.gate_proj.weight": "model-00004-of-00019.safetensors",
306
+ "model.layers.22.mlp.up_proj.weight": "model-00005-of-00019.safetensors",
307
+ "model.layers.22.post_attention_layernorm.weight": "model-00005-of-00019.safetensors",
308
+ "model.layers.22.self_attn.k_proj.weight": "model-00005-of-00019.safetensors",
309
+ "model.layers.22.self_attn.o_proj.weight": "model-00005-of-00019.safetensors",
310
+ "model.layers.22.self_attn.q_proj.weight": "model-00005-of-00019.safetensors",
311
+ "model.layers.22.self_attn.v_proj.weight": "model-00005-of-00019.safetensors",
312
+ "model.layers.23.input_layernorm.weight": "model-00005-of-00019.safetensors",
313
+ "model.layers.23.mlp.down_proj.weight": "model-00005-of-00019.safetensors",
314
+ "model.layers.23.mlp.gate_proj.weight": "model-00005-of-00019.safetensors",
315
+ "model.layers.23.mlp.up_proj.weight": "model-00005-of-00019.safetensors",
316
+ "model.layers.23.post_attention_layernorm.weight": "model-00005-of-00019.safetensors",
317
+ "model.layers.23.self_attn.k_proj.weight": "model-00005-of-00019.safetensors",
318
+ "model.layers.23.self_attn.o_proj.weight": "model-00005-of-00019.safetensors",
319
+ "model.layers.23.self_attn.q_proj.weight": "model-00005-of-00019.safetensors",
320
+ "model.layers.23.self_attn.v_proj.weight": "model-00005-of-00019.safetensors",
321
+ "model.layers.24.input_layernorm.weight": "model-00005-of-00019.safetensors",
322
+ "model.layers.24.mlp.down_proj.weight": "model-00005-of-00019.safetensors",
323
+ "model.layers.24.mlp.gate_proj.weight": "model-00005-of-00019.safetensors",
324
+ "model.layers.24.mlp.up_proj.weight": "model-00005-of-00019.safetensors",
325
+ "model.layers.24.post_attention_layernorm.weight": "model-00005-of-00019.safetensors",
326
+ "model.layers.24.self_attn.k_proj.weight": "model-00005-of-00019.safetensors",
327
+ "model.layers.24.self_attn.o_proj.weight": "model-00005-of-00019.safetensors",
328
+ "model.layers.24.self_attn.q_proj.weight": "model-00005-of-00019.safetensors",
329
+ "model.layers.24.self_attn.v_proj.weight": "model-00005-of-00019.safetensors",
330
+ "model.layers.25.input_layernorm.weight": "model-00005-of-00019.safetensors",
331
+ "model.layers.25.mlp.down_proj.weight": "model-00005-of-00019.safetensors",
332
+ "model.layers.25.mlp.gate_proj.weight": "model-00005-of-00019.safetensors",
333
+ "model.layers.25.mlp.up_proj.weight": "model-00005-of-00019.safetensors",
334
+ "model.layers.25.post_attention_layernorm.weight": "model-00005-of-00019.safetensors",
335
+ "model.layers.25.self_attn.k_proj.weight": "model-00005-of-00019.safetensors",
336
+ "model.layers.25.self_attn.o_proj.weight": "model-00005-of-00019.safetensors",
337
+ "model.layers.25.self_attn.q_proj.weight": "model-00005-of-00019.safetensors",
338
+ "model.layers.25.self_attn.v_proj.weight": "model-00005-of-00019.safetensors",
339
+ "model.layers.26.input_layernorm.weight": "model-00005-of-00019.safetensors",
340
+ "model.layers.26.mlp.down_proj.weight": "model-00005-of-00019.safetensors",
341
+ "model.layers.26.mlp.gate_proj.weight": "model-00005-of-00019.safetensors",
342
+ "model.layers.26.mlp.up_proj.weight": "model-00005-of-00019.safetensors",
343
+ "model.layers.26.post_attention_layernorm.weight": "model-00005-of-00019.safetensors",
344
+ "model.layers.26.self_attn.k_proj.weight": "model-00005-of-00019.safetensors",
345
+ "model.layers.26.self_attn.o_proj.weight": "model-00005-of-00019.safetensors",
346
+ "model.layers.26.self_attn.q_proj.weight": "model-00006-of-00019.safetensors",
347
+ "model.layers.26.self_attn.v_proj.weight": "model-00006-of-00019.safetensors",
348
+ "model.layers.27.input_layernorm.weight": "model-00006-of-00019.safetensors",
349
+ "model.layers.27.mlp.down_proj.weight": "model-00006-of-00019.safetensors",
350
+ "model.layers.27.mlp.gate_proj.weight": "model-00006-of-00019.safetensors",
351
+ "model.layers.27.mlp.up_proj.weight": "model-00006-of-00019.safetensors",
352
+ "model.layers.27.post_attention_layernorm.weight": "model-00006-of-00019.safetensors",
353
+ "model.layers.27.self_attn.k_proj.weight": "model-00006-of-00019.safetensors",
354
+ "model.layers.27.self_attn.o_proj.weight": "model-00006-of-00019.safetensors",
355
+ "model.layers.27.self_attn.q_proj.weight": "model-00006-of-00019.safetensors",
356
+ "model.layers.27.self_attn.v_proj.weight": "model-00006-of-00019.safetensors",
357
+ "model.layers.28.input_layernorm.weight": "model-00006-of-00019.safetensors",
358
+ "model.layers.28.mlp.down_proj.weight": "model-00006-of-00019.safetensors",
359
+ "model.layers.28.mlp.gate_proj.weight": "model-00006-of-00019.safetensors",
360
+ "model.layers.28.mlp.up_proj.weight": "model-00006-of-00019.safetensors",
361
+ "model.layers.28.post_attention_layernorm.weight": "model-00006-of-00019.safetensors",
362
+ "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00019.safetensors",
363
+ "model.layers.28.self_attn.o_proj.weight": "model-00006-of-00019.safetensors",
364
+ "model.layers.28.self_attn.q_proj.weight": "model-00006-of-00019.safetensors",
365
+ "model.layers.28.self_attn.v_proj.weight": "model-00006-of-00019.safetensors",
366
+ "model.layers.29.input_layernorm.weight": "model-00006-of-00019.safetensors",
367
+ "model.layers.29.mlp.down_proj.weight": "model-00006-of-00019.safetensors",
368
+ "model.layers.29.mlp.gate_proj.weight": "model-00006-of-00019.safetensors",
369
+ "model.layers.29.mlp.up_proj.weight": "model-00006-of-00019.safetensors",
370
+ "model.layers.29.post_attention_layernorm.weight": "model-00006-of-00019.safetensors",
371
+ "model.layers.29.self_attn.k_proj.weight": "model-00006-of-00019.safetensors",
372
+ "model.layers.29.self_attn.o_proj.weight": "model-00006-of-00019.safetensors",
373
+ "model.layers.29.self_attn.q_proj.weight": "model-00006-of-00019.safetensors",
374
+ "model.layers.29.self_attn.v_proj.weight": "model-00006-of-00019.safetensors",
375
+ "model.layers.3.input_layernorm.weight": "model-00006-of-00019.safetensors",
376
+ "model.layers.3.mlp.down_proj.weight": "model-00006-of-00019.safetensors",
377
+ "model.layers.3.mlp.gate_proj.weight": "model-00006-of-00019.safetensors",
378
+ "model.layers.3.mlp.up_proj.weight": "model-00006-of-00019.safetensors",
379
+ "model.layers.3.post_attention_layernorm.weight": "model-00006-of-00019.safetensors",
380
+ "model.layers.3.self_attn.k_proj.weight": "model-00006-of-00019.safetensors",
381
+ "model.layers.3.self_attn.o_proj.weight": "model-00006-of-00019.safetensors",
382
+ "model.layers.3.self_attn.q_proj.weight": "model-00006-of-00019.safetensors",
383
+ "model.layers.3.self_attn.v_proj.weight": "model-00006-of-00019.safetensors",
384
+ "model.layers.30.input_layernorm.weight": "model-00006-of-00019.safetensors",
385
+ "model.layers.30.mlp.down_proj.weight": "model-00006-of-00019.safetensors",
386
+ "model.layers.30.mlp.gate_proj.weight": "model-00007-of-00019.safetensors",
387
+ "model.layers.30.mlp.up_proj.weight": "model-00007-of-00019.safetensors",
388
+ "model.layers.30.post_attention_layernorm.weight": "model-00007-of-00019.safetensors",
389
+ "model.layers.30.self_attn.k_proj.weight": "model-00007-of-00019.safetensors",
390
+ "model.layers.30.self_attn.o_proj.weight": "model-00007-of-00019.safetensors",
391
+ "model.layers.30.self_attn.q_proj.weight": "model-00007-of-00019.safetensors",
392
+ "model.layers.30.self_attn.v_proj.weight": "model-00007-of-00019.safetensors",
393
+ "model.layers.31.input_layernorm.weight": "model-00007-of-00019.safetensors",
394
+ "model.layers.31.mlp.down_proj.weight": "model-00007-of-00019.safetensors",
395
+ "model.layers.31.mlp.gate_proj.weight": "model-00007-of-00019.safetensors",
396
+ "model.layers.31.mlp.up_proj.weight": "model-00007-of-00019.safetensors",
397
+ "model.layers.31.post_attention_layernorm.weight": "model-00007-of-00019.safetensors",
398
+ "model.layers.31.self_attn.k_proj.weight": "model-00007-of-00019.safetensors",
399
+ "model.layers.31.self_attn.o_proj.weight": "model-00007-of-00019.safetensors",
400
+ "model.layers.31.self_attn.q_proj.weight": "model-00007-of-00019.safetensors",
401
+ "model.layers.31.self_attn.v_proj.weight": "model-00007-of-00019.safetensors",
402
+ "model.layers.32.input_layernorm.weight": "model-00007-of-00019.safetensors",
403
+ "model.layers.32.mlp.down_proj.weight": "model-00007-of-00019.safetensors",
404
+ "model.layers.32.mlp.gate_proj.weight": "model-00007-of-00019.safetensors",
405
+ "model.layers.32.mlp.up_proj.weight": "model-00007-of-00019.safetensors",
406
+ "model.layers.32.post_attention_layernorm.weight": "model-00007-of-00019.safetensors",
407
+ "model.layers.32.self_attn.k_proj.weight": "model-00007-of-00019.safetensors",
408
+ "model.layers.32.self_attn.o_proj.weight": "model-00007-of-00019.safetensors",
409
+ "model.layers.32.self_attn.q_proj.weight": "model-00007-of-00019.safetensors",
410
+ "model.layers.32.self_attn.v_proj.weight": "model-00007-of-00019.safetensors",
411
+ "model.layers.33.input_layernorm.weight": "model-00007-of-00019.safetensors",
412
+ "model.layers.33.mlp.down_proj.weight": "model-00007-of-00019.safetensors",
413
+ "model.layers.33.mlp.gate_proj.weight": "model-00007-of-00019.safetensors",
414
+ "model.layers.33.mlp.up_proj.weight": "model-00007-of-00019.safetensors",
415
+ "model.layers.33.post_attention_layernorm.weight": "model-00007-of-00019.safetensors",
416
+ "model.layers.33.self_attn.k_proj.weight": "model-00007-of-00019.safetensors",
417
+ "model.layers.33.self_attn.o_proj.weight": "model-00007-of-00019.safetensors",
418
+ "model.layers.33.self_attn.q_proj.weight": "model-00007-of-00019.safetensors",
419
+ "model.layers.33.self_attn.v_proj.weight": "model-00007-of-00019.safetensors",
420
+ "model.layers.34.input_layernorm.weight": "model-00007-of-00019.safetensors",
421
+ "model.layers.34.mlp.down_proj.weight": "model-00007-of-00019.safetensors",
422
+ "model.layers.34.mlp.gate_proj.weight": "model-00007-of-00019.safetensors",
423
+ "model.layers.34.mlp.up_proj.weight": "model-00008-of-00019.safetensors",
424
+ "model.layers.34.post_attention_layernorm.weight": "model-00008-of-00019.safetensors",
425
+ "model.layers.34.self_attn.k_proj.weight": "model-00008-of-00019.safetensors",
426
+ "model.layers.34.self_attn.o_proj.weight": "model-00008-of-00019.safetensors",
427
+ "model.layers.34.self_attn.q_proj.weight": "model-00008-of-00019.safetensors",
428
+ "model.layers.34.self_attn.v_proj.weight": "model-00008-of-00019.safetensors",
429
+ "model.layers.35.input_layernorm.weight": "model-00008-of-00019.safetensors",
430
+ "model.layers.35.mlp.down_proj.weight": "model-00008-of-00019.safetensors",
431
+ "model.layers.35.mlp.gate_proj.weight": "model-00008-of-00019.safetensors",
432
+ "model.layers.35.mlp.up_proj.weight": "model-00008-of-00019.safetensors",
433
+ "model.layers.35.post_attention_layernorm.weight": "model-00008-of-00019.safetensors",
434
+ "model.layers.35.self_attn.k_proj.weight": "model-00008-of-00019.safetensors",
435
+ "model.layers.35.self_attn.o_proj.weight": "model-00008-of-00019.safetensors",
436
+ "model.layers.35.self_attn.q_proj.weight": "model-00008-of-00019.safetensors",
437
+ "model.layers.35.self_attn.v_proj.weight": "model-00008-of-00019.safetensors",
438
+ "model.layers.36.input_layernorm.weight": "model-00008-of-00019.safetensors",
439
+ "model.layers.36.mlp.down_proj.weight": "model-00008-of-00019.safetensors",
440
+ "model.layers.36.mlp.gate_proj.weight": "model-00008-of-00019.safetensors",
441
+ "model.layers.36.mlp.up_proj.weight": "model-00008-of-00019.safetensors",
442
+ "model.layers.36.post_attention_layernorm.weight": "model-00008-of-00019.safetensors",
443
+ "model.layers.36.self_attn.k_proj.weight": "model-00008-of-00019.safetensors",
444
+ "model.layers.36.self_attn.o_proj.weight": "model-00008-of-00019.safetensors",
445
+ "model.layers.36.self_attn.q_proj.weight": "model-00008-of-00019.safetensors",
446
+ "model.layers.36.self_attn.v_proj.weight": "model-00008-of-00019.safetensors",
447
+ "model.layers.37.input_layernorm.weight": "model-00008-of-00019.safetensors",
448
+ "model.layers.37.mlp.down_proj.weight": "model-00008-of-00019.safetensors",
449
+ "model.layers.37.mlp.gate_proj.weight": "model-00008-of-00019.safetensors",
450
+ "model.layers.37.mlp.up_proj.weight": "model-00008-of-00019.safetensors",
451
+ "model.layers.37.post_attention_layernorm.weight": "model-00008-of-00019.safetensors",
452
+ "model.layers.37.self_attn.k_proj.weight": "model-00008-of-00019.safetensors",
453
+ "model.layers.37.self_attn.o_proj.weight": "model-00008-of-00019.safetensors",
454
+ "model.layers.37.self_attn.q_proj.weight": "model-00008-of-00019.safetensors",
455
+ "model.layers.37.self_attn.v_proj.weight": "model-00008-of-00019.safetensors",
456
+ "model.layers.38.input_layernorm.weight": "model-00008-of-00019.safetensors",
457
+ "model.layers.38.mlp.down_proj.weight": "model-00008-of-00019.safetensors",
458
+ "model.layers.38.mlp.gate_proj.weight": "model-00008-of-00019.safetensors",
459
+ "model.layers.38.mlp.up_proj.weight": "model-00008-of-00019.safetensors",
460
+ "model.layers.38.post_attention_layernorm.weight": "model-00008-of-00019.safetensors",
461
+ "model.layers.38.self_attn.k_proj.weight": "model-00008-of-00019.safetensors",
462
+ "model.layers.38.self_attn.o_proj.weight": "model-00008-of-00019.safetensors",
463
+ "model.layers.38.self_attn.q_proj.weight": "model-00009-of-00019.safetensors",
464
+ "model.layers.38.self_attn.v_proj.weight": "model-00009-of-00019.safetensors",
465
+ "model.layers.39.input_layernorm.weight": "model-00009-of-00019.safetensors",
466
+ "model.layers.39.mlp.down_proj.weight": "model-00009-of-00019.safetensors",
467
+ "model.layers.39.mlp.gate_proj.weight": "model-00009-of-00019.safetensors",
468
+ "model.layers.39.mlp.up_proj.weight": "model-00009-of-00019.safetensors",
469
+ "model.layers.39.post_attention_layernorm.weight": "model-00009-of-00019.safetensors",
470
+ "model.layers.39.self_attn.k_proj.weight": "model-00009-of-00019.safetensors",
471
+ "model.layers.39.self_attn.o_proj.weight": "model-00009-of-00019.safetensors",
472
+ "model.layers.39.self_attn.q_proj.weight": "model-00009-of-00019.safetensors",
473
+ "model.layers.39.self_attn.v_proj.weight": "model-00009-of-00019.safetensors",
474
+ "model.layers.4.input_layernorm.weight": "model-00009-of-00019.safetensors",
475
+ "model.layers.4.mlp.down_proj.weight": "model-00009-of-00019.safetensors",
476
+ "model.layers.4.mlp.gate_proj.weight": "model-00009-of-00019.safetensors",
477
+ "model.layers.4.mlp.up_proj.weight": "model-00009-of-00019.safetensors",
478
+ "model.layers.4.post_attention_layernorm.weight": "model-00009-of-00019.safetensors",
479
+ "model.layers.4.self_attn.k_proj.weight": "model-00009-of-00019.safetensors",
480
+ "model.layers.4.self_attn.o_proj.weight": "model-00009-of-00019.safetensors",
481
+ "model.layers.4.self_attn.q_proj.weight": "model-00009-of-00019.safetensors",
482
+ "model.layers.4.self_attn.v_proj.weight": "model-00009-of-00019.safetensors",
483
+ "model.layers.40.input_layernorm.weight": "model-00009-of-00019.safetensors",
484
+ "model.layers.40.mlp.down_proj.weight": "model-00009-of-00019.safetensors",
485
+ "model.layers.40.mlp.gate_proj.weight": "model-00009-of-00019.safetensors",
486
+ "model.layers.40.mlp.up_proj.weight": "model-00009-of-00019.safetensors",
487
+ "model.layers.40.post_attention_layernorm.weight": "model-00009-of-00019.safetensors",
488
+ "model.layers.40.self_attn.k_proj.weight": "model-00009-of-00019.safetensors",
489
+ "model.layers.40.self_attn.o_proj.weight": "model-00009-of-00019.safetensors",
490
+ "model.layers.40.self_attn.q_proj.weight": "model-00009-of-00019.safetensors",
491
+ "model.layers.40.self_attn.v_proj.weight": "model-00009-of-00019.safetensors",
492
+ "model.layers.41.input_layernorm.weight": "model-00009-of-00019.safetensors",
493
+ "model.layers.41.mlp.down_proj.weight": "model-00009-of-00019.safetensors",
494
+ "model.layers.41.mlp.gate_proj.weight": "model-00009-of-00019.safetensors",
495
+ "model.layers.41.mlp.up_proj.weight": "model-00009-of-00019.safetensors",
496
+ "model.layers.41.post_attention_layernorm.weight": "model-00009-of-00019.safetensors",
497
+ "model.layers.41.self_attn.k_proj.weight": "model-00009-of-00019.safetensors",
498
+ "model.layers.41.self_attn.o_proj.weight": "model-00009-of-00019.safetensors",
499
+ "model.layers.41.self_attn.q_proj.weight": "model-00009-of-00019.safetensors",
500
+ "model.layers.41.self_attn.v_proj.weight": "model-00009-of-00019.safetensors",
501
+ "model.layers.42.input_layernorm.weight": "model-00009-of-00019.safetensors",
502
+ "model.layers.42.mlp.down_proj.weight": "model-00009-of-00019.safetensors",
503
+ "model.layers.42.mlp.gate_proj.weight": "model-00010-of-00019.safetensors",
504
+ "model.layers.42.mlp.up_proj.weight": "model-00010-of-00019.safetensors",
505
+ "model.layers.42.post_attention_layernorm.weight": "model-00010-of-00019.safetensors",
506
+ "model.layers.42.self_attn.k_proj.weight": "model-00010-of-00019.safetensors",
507
+ "model.layers.42.self_attn.o_proj.weight": "model-00010-of-00019.safetensors",
508
+ "model.layers.42.self_attn.q_proj.weight": "model-00010-of-00019.safetensors",
509
+ "model.layers.42.self_attn.v_proj.weight": "model-00010-of-00019.safetensors",
510
+ "model.layers.43.input_layernorm.weight": "model-00010-of-00019.safetensors",
511
+ "model.layers.43.mlp.down_proj.weight": "model-00010-of-00019.safetensors",
512
+ "model.layers.43.mlp.gate_proj.weight": "model-00010-of-00019.safetensors",
513
+ "model.layers.43.mlp.up_proj.weight": "model-00010-of-00019.safetensors",
514
+ "model.layers.43.post_attention_layernorm.weight": "model-00010-of-00019.safetensors",
515
+ "model.layers.43.self_attn.k_proj.weight": "model-00010-of-00019.safetensors",
516
+ "model.layers.43.self_attn.o_proj.weight": "model-00010-of-00019.safetensors",
517
+ "model.layers.43.self_attn.q_proj.weight": "model-00010-of-00019.safetensors",
518
+ "model.layers.43.self_attn.v_proj.weight": "model-00010-of-00019.safetensors",
519
+ "model.layers.44.input_layernorm.weight": "model-00010-of-00019.safetensors",
520
+ "model.layers.44.mlp.down_proj.weight": "model-00010-of-00019.safetensors",
521
+ "model.layers.44.mlp.gate_proj.weight": "model-00010-of-00019.safetensors",
522
+ "model.layers.44.mlp.up_proj.weight": "model-00010-of-00019.safetensors",
523
+ "model.layers.44.post_attention_layernorm.weight": "model-00010-of-00019.safetensors",
524
+ "model.layers.44.self_attn.k_proj.weight": "model-00010-of-00019.safetensors",
525
+ "model.layers.44.self_attn.o_proj.weight": "model-00010-of-00019.safetensors",
526
+ "model.layers.44.self_attn.q_proj.weight": "model-00010-of-00019.safetensors",
527
+ "model.layers.44.self_attn.v_proj.weight": "model-00010-of-00019.safetensors",
528
+ "model.layers.45.input_layernorm.weight": "model-00010-of-00019.safetensors",
529
+ "model.layers.45.mlp.down_proj.weight": "model-00010-of-00019.safetensors",
530
+ "model.layers.45.mlp.gate_proj.weight": "model-00010-of-00019.safetensors",
531
+ "model.layers.45.mlp.up_proj.weight": "model-00010-of-00019.safetensors",
532
+ "model.layers.45.post_attention_layernorm.weight": "model-00010-of-00019.safetensors",
533
+ "model.layers.45.self_attn.k_proj.weight": "model-00010-of-00019.safetensors",
534
+ "model.layers.45.self_attn.o_proj.weight": "model-00010-of-00019.safetensors",
535
+ "model.layers.45.self_attn.q_proj.weight": "model-00010-of-00019.safetensors",
536
+ "model.layers.45.self_attn.v_proj.weight": "model-00010-of-00019.safetensors",
537
+ "model.layers.46.input_layernorm.weight": "model-00010-of-00019.safetensors",
538
+ "model.layers.46.mlp.down_proj.weight": "model-00010-of-00019.safetensors",
539
+ "model.layers.46.mlp.gate_proj.weight": "model-00010-of-00019.safetensors",
540
+ "model.layers.46.mlp.up_proj.weight": "model-00011-of-00019.safetensors",
541
+ "model.layers.46.post_attention_layernorm.weight": "model-00011-of-00019.safetensors",
542
+ "model.layers.46.self_attn.k_proj.weight": "model-00011-of-00019.safetensors",
543
+ "model.layers.46.self_attn.o_proj.weight": "model-00011-of-00019.safetensors",
544
+ "model.layers.46.self_attn.q_proj.weight": "model-00011-of-00019.safetensors",
545
+ "model.layers.46.self_attn.v_proj.weight": "model-00011-of-00019.safetensors",
546
+ "model.layers.47.input_layernorm.weight": "model-00011-of-00019.safetensors",
547
+ "model.layers.47.mlp.down_proj.weight": "model-00011-of-00019.safetensors",
548
+ "model.layers.47.mlp.gate_proj.weight": "model-00011-of-00019.safetensors",
549
+ "model.layers.47.mlp.up_proj.weight": "model-00011-of-00019.safetensors",
550
+ "model.layers.47.post_attention_layernorm.weight": "model-00011-of-00019.safetensors",
551
+ "model.layers.47.self_attn.k_proj.weight": "model-00011-of-00019.safetensors",
552
+ "model.layers.47.self_attn.o_proj.weight": "model-00011-of-00019.safetensors",
553
+ "model.layers.47.self_attn.q_proj.weight": "model-00011-of-00019.safetensors",
554
+ "model.layers.47.self_attn.v_proj.weight": "model-00011-of-00019.safetensors",
555
+ "model.layers.48.input_layernorm.weight": "model-00011-of-00019.safetensors",
556
+ "model.layers.48.mlp.down_proj.weight": "model-00011-of-00019.safetensors",
557
+ "model.layers.48.mlp.gate_proj.weight": "model-00011-of-00019.safetensors",
558
+ "model.layers.48.mlp.up_proj.weight": "model-00011-of-00019.safetensors",
559
+ "model.layers.48.post_attention_layernorm.weight": "model-00011-of-00019.safetensors",
560
+ "model.layers.48.self_attn.k_proj.weight": "model-00011-of-00019.safetensors",
561
+ "model.layers.48.self_attn.o_proj.weight": "model-00011-of-00019.safetensors",
562
+ "model.layers.48.self_attn.q_proj.weight": "model-00011-of-00019.safetensors",
563
+ "model.layers.48.self_attn.v_proj.weight": "model-00011-of-00019.safetensors",
564
+ "model.layers.49.input_layernorm.weight": "model-00011-of-00019.safetensors",
565
+ "model.layers.49.mlp.down_proj.weight": "model-00011-of-00019.safetensors",
566
+ "model.layers.49.mlp.gate_proj.weight": "model-00011-of-00019.safetensors",
567
+ "model.layers.49.mlp.up_proj.weight": "model-00011-of-00019.safetensors",
568
+ "model.layers.49.post_attention_layernorm.weight": "model-00011-of-00019.safetensors",
569
+ "model.layers.49.self_attn.k_proj.weight": "model-00011-of-00019.safetensors",
570
+ "model.layers.49.self_attn.o_proj.weight": "model-00011-of-00019.safetensors",
571
+ "model.layers.49.self_attn.q_proj.weight": "model-00011-of-00019.safetensors",
572
+ "model.layers.49.self_attn.v_proj.weight": "model-00011-of-00019.safetensors",
573
+ "model.layers.5.input_layernorm.weight": "model-00011-of-00019.safetensors",
574
+ "model.layers.5.mlp.down_proj.weight": "model-00011-of-00019.safetensors",
575
+ "model.layers.5.mlp.gate_proj.weight": "model-00011-of-00019.safetensors",
576
+ "model.layers.5.mlp.up_proj.weight": "model-00011-of-00019.safetensors",
577
+ "model.layers.5.post_attention_layernorm.weight": "model-00011-of-00019.safetensors",
578
+ "model.layers.5.self_attn.k_proj.weight": "model-00011-of-00019.safetensors",
579
+ "model.layers.5.self_attn.o_proj.weight": "model-00011-of-00019.safetensors",
580
+ "model.layers.5.self_attn.q_proj.weight": "model-00012-of-00019.safetensors",
581
+ "model.layers.5.self_attn.v_proj.weight": "model-00012-of-00019.safetensors",
582
+ "model.layers.50.input_layernorm.weight": "model-00012-of-00019.safetensors",
583
+ "model.layers.50.mlp.down_proj.weight": "model-00012-of-00019.safetensors",
584
+ "model.layers.50.mlp.gate_proj.weight": "model-00012-of-00019.safetensors",
585
+ "model.layers.50.mlp.up_proj.weight": "model-00012-of-00019.safetensors",
586
+ "model.layers.50.post_attention_layernorm.weight": "model-00012-of-00019.safetensors",
587
+ "model.layers.50.self_attn.k_proj.weight": "model-00012-of-00019.safetensors",
588
+ "model.layers.50.self_attn.o_proj.weight": "model-00012-of-00019.safetensors",
589
+ "model.layers.50.self_attn.q_proj.weight": "model-00012-of-00019.safetensors",
590
+ "model.layers.50.self_attn.v_proj.weight": "model-00012-of-00019.safetensors",
591
+ "model.layers.51.input_layernorm.weight": "model-00012-of-00019.safetensors",
592
+ "model.layers.51.mlp.down_proj.weight": "model-00012-of-00019.safetensors",
593
+ "model.layers.51.mlp.gate_proj.weight": "model-00012-of-00019.safetensors",
594
+ "model.layers.51.mlp.up_proj.weight": "model-00012-of-00019.safetensors",
595
+ "model.layers.51.post_attention_layernorm.weight": "model-00012-of-00019.safetensors",
596
+ "model.layers.51.self_attn.k_proj.weight": "model-00012-of-00019.safetensors",
597
+ "model.layers.51.self_attn.o_proj.weight": "model-00012-of-00019.safetensors",
598
+ "model.layers.51.self_attn.q_proj.weight": "model-00012-of-00019.safetensors",
599
+ "model.layers.51.self_attn.v_proj.weight": "model-00012-of-00019.safetensors",
600
+ "model.layers.52.input_layernorm.weight": "model-00012-of-00019.safetensors",
601
+ "model.layers.52.mlp.down_proj.weight": "model-00012-of-00019.safetensors",
602
+ "model.layers.52.mlp.gate_proj.weight": "model-00012-of-00019.safetensors",
603
+ "model.layers.52.mlp.up_proj.weight": "model-00012-of-00019.safetensors",
604
+ "model.layers.52.post_attention_layernorm.weight": "model-00012-of-00019.safetensors",
605
+ "model.layers.52.self_attn.k_proj.weight": "model-00012-of-00019.safetensors",
606
+ "model.layers.52.self_attn.o_proj.weight": "model-00012-of-00019.safetensors",
607
+ "model.layers.52.self_attn.q_proj.weight": "model-00012-of-00019.safetensors",
608
+ "model.layers.52.self_attn.v_proj.weight": "model-00012-of-00019.safetensors",
609
+ "model.layers.53.input_layernorm.weight": "model-00012-of-00019.safetensors",
610
+ "model.layers.53.mlp.down_proj.weight": "model-00012-of-00019.safetensors",
611
+ "model.layers.53.mlp.gate_proj.weight": "model-00012-of-00019.safetensors",
612
+ "model.layers.53.mlp.up_proj.weight": "model-00012-of-00019.safetensors",
613
+ "model.layers.53.post_attention_layernorm.weight": "model-00012-of-00019.safetensors",
614
+ "model.layers.53.self_attn.k_proj.weight": "model-00012-of-00019.safetensors",
615
+ "model.layers.53.self_attn.o_proj.weight": "model-00012-of-00019.safetensors",
616
+ "model.layers.53.self_attn.q_proj.weight": "model-00012-of-00019.safetensors",
617
+ "model.layers.53.self_attn.v_proj.weight": "model-00012-of-00019.safetensors",
618
+ "model.layers.54.input_layernorm.weight": "model-00012-of-00019.safetensors",
619
+ "model.layers.54.mlp.down_proj.weight": "model-00012-of-00019.safetensors",
620
+ "model.layers.54.mlp.gate_proj.weight": "model-00013-of-00019.safetensors",
621
+ "model.layers.54.mlp.up_proj.weight": "model-00013-of-00019.safetensors",
622
+ "model.layers.54.post_attention_layernorm.weight": "model-00013-of-00019.safetensors",
623
+ "model.layers.54.self_attn.k_proj.weight": "model-00013-of-00019.safetensors",
624
+ "model.layers.54.self_attn.o_proj.weight": "model-00013-of-00019.safetensors",
625
+ "model.layers.54.self_attn.q_proj.weight": "model-00013-of-00019.safetensors",
626
+ "model.layers.54.self_attn.v_proj.weight": "model-00013-of-00019.safetensors",
627
+ "model.layers.55.input_layernorm.weight": "model-00013-of-00019.safetensors",
628
+ "model.layers.55.mlp.down_proj.weight": "model-00013-of-00019.safetensors",
629
+ "model.layers.55.mlp.gate_proj.weight": "model-00013-of-00019.safetensors",
630
+ "model.layers.55.mlp.up_proj.weight": "model-00013-of-00019.safetensors",
631
+ "model.layers.55.post_attention_layernorm.weight": "model-00013-of-00019.safetensors",
632
+ "model.layers.55.self_attn.k_proj.weight": "model-00013-of-00019.safetensors",
633
+ "model.layers.55.self_attn.o_proj.weight": "model-00013-of-00019.safetensors",
634
+ "model.layers.55.self_attn.q_proj.weight": "model-00013-of-00019.safetensors",
635
+ "model.layers.55.self_attn.v_proj.weight": "model-00013-of-00019.safetensors",
636
+ "model.layers.56.input_layernorm.weight": "model-00013-of-00019.safetensors",
637
+ "model.layers.56.mlp.down_proj.weight": "model-00013-of-00019.safetensors",
638
+ "model.layers.56.mlp.gate_proj.weight": "model-00013-of-00019.safetensors",
639
+ "model.layers.56.mlp.up_proj.weight": "model-00013-of-00019.safetensors",
640
+ "model.layers.56.post_attention_layernorm.weight": "model-00013-of-00019.safetensors",
641
+ "model.layers.56.self_attn.k_proj.weight": "model-00013-of-00019.safetensors",
642
+ "model.layers.56.self_attn.o_proj.weight": "model-00013-of-00019.safetensors",
643
+ "model.layers.56.self_attn.q_proj.weight": "model-00013-of-00019.safetensors",
644
+ "model.layers.56.self_attn.v_proj.weight": "model-00013-of-00019.safetensors",
645
+ "model.layers.57.input_layernorm.weight": "model-00013-of-00019.safetensors",
646
+ "model.layers.57.mlp.down_proj.weight": "model-00013-of-00019.safetensors",
647
+ "model.layers.57.mlp.gate_proj.weight": "model-00013-of-00019.safetensors",
648
+ "model.layers.57.mlp.up_proj.weight": "model-00013-of-00019.safetensors",
649
+ "model.layers.57.post_attention_layernorm.weight": "model-00013-of-00019.safetensors",
650
+ "model.layers.57.self_attn.k_proj.weight": "model-00013-of-00019.safetensors",
651
+ "model.layers.57.self_attn.o_proj.weight": "model-00013-of-00019.safetensors",
652
+ "model.layers.57.self_attn.q_proj.weight": "model-00013-of-00019.safetensors",
653
+ "model.layers.57.self_attn.v_proj.weight": "model-00013-of-00019.safetensors",
654
+ "model.layers.58.input_layernorm.weight": "model-00013-of-00019.safetensors",
655
+ "model.layers.58.mlp.down_proj.weight": "model-00013-of-00019.safetensors",
656
+ "model.layers.58.mlp.gate_proj.weight": "model-00013-of-00019.safetensors",
657
+ "model.layers.58.mlp.up_proj.weight": "model-00014-of-00019.safetensors",
658
+ "model.layers.58.post_attention_layernorm.weight": "model-00014-of-00019.safetensors",
659
+ "model.layers.58.self_attn.k_proj.weight": "model-00014-of-00019.safetensors",
660
+ "model.layers.58.self_attn.o_proj.weight": "model-00014-of-00019.safetensors",
661
+ "model.layers.58.self_attn.q_proj.weight": "model-00014-of-00019.safetensors",
662
+ "model.layers.58.self_attn.v_proj.weight": "model-00014-of-00019.safetensors",
663
+ "model.layers.59.input_layernorm.weight": "model-00014-of-00019.safetensors",
664
+ "model.layers.59.mlp.down_proj.weight": "model-00014-of-00019.safetensors",
665
+ "model.layers.59.mlp.gate_proj.weight": "model-00014-of-00019.safetensors",
666
+ "model.layers.59.mlp.up_proj.weight": "model-00014-of-00019.safetensors",
667
+ "model.layers.59.post_attention_layernorm.weight": "model-00014-of-00019.safetensors",
668
+ "model.layers.59.self_attn.k_proj.weight": "model-00014-of-00019.safetensors",
669
+ "model.layers.59.self_attn.o_proj.weight": "model-00014-of-00019.safetensors",
670
+ "model.layers.59.self_attn.q_proj.weight": "model-00014-of-00019.safetensors",
671
+ "model.layers.59.self_attn.v_proj.weight": "model-00014-of-00019.safetensors",
672
+ "model.layers.6.input_layernorm.weight": "model-00014-of-00019.safetensors",
673
+ "model.layers.6.mlp.down_proj.weight": "model-00014-of-00019.safetensors",
674
+ "model.layers.6.mlp.gate_proj.weight": "model-00014-of-00019.safetensors",
675
+ "model.layers.6.mlp.up_proj.weight": "model-00014-of-00019.safetensors",
676
+ "model.layers.6.post_attention_layernorm.weight": "model-00014-of-00019.safetensors",
677
+ "model.layers.6.self_attn.k_proj.weight": "model-00014-of-00019.safetensors",
678
+ "model.layers.6.self_attn.o_proj.weight": "model-00014-of-00019.safetensors",
679
+ "model.layers.6.self_attn.q_proj.weight": "model-00014-of-00019.safetensors",
680
+ "model.layers.6.self_attn.v_proj.weight": "model-00014-of-00019.safetensors",
681
+ "model.layers.60.input_layernorm.weight": "model-00014-of-00019.safetensors",
682
+ "model.layers.60.mlp.down_proj.weight": "model-00014-of-00019.safetensors",
683
+ "model.layers.60.mlp.gate_proj.weight": "model-00014-of-00019.safetensors",
684
+ "model.layers.60.mlp.up_proj.weight": "model-00014-of-00019.safetensors",
685
+ "model.layers.60.post_attention_layernorm.weight": "model-00014-of-00019.safetensors",
686
+ "model.layers.60.self_attn.k_proj.weight": "model-00014-of-00019.safetensors",
687
+ "model.layers.60.self_attn.o_proj.weight": "model-00014-of-00019.safetensors",
688
+ "model.layers.60.self_attn.q_proj.weight": "model-00014-of-00019.safetensors",
689
+ "model.layers.60.self_attn.v_proj.weight": "model-00014-of-00019.safetensors",
690
+ "model.layers.61.input_layernorm.weight": "model-00014-of-00019.safetensors",
691
+ "model.layers.61.mlp.down_proj.weight": "model-00014-of-00019.safetensors",
692
+ "model.layers.61.mlp.gate_proj.weight": "model-00014-of-00019.safetensors",
693
+ "model.layers.61.mlp.up_proj.weight": "model-00014-of-00019.safetensors",
694
+ "model.layers.61.post_attention_layernorm.weight": "model-00014-of-00019.safetensors",
695
+ "model.layers.61.self_attn.k_proj.weight": "model-00014-of-00019.safetensors",
696
+ "model.layers.61.self_attn.o_proj.weight": "model-00014-of-00019.safetensors",
697
+ "model.layers.61.self_attn.q_proj.weight": "model-00015-of-00019.safetensors",
698
+ "model.layers.61.self_attn.v_proj.weight": "model-00015-of-00019.safetensors",
699
+ "model.layers.62.input_layernorm.weight": "model-00015-of-00019.safetensors",
700
+ "model.layers.62.mlp.down_proj.weight": "model-00015-of-00019.safetensors",
701
+ "model.layers.62.mlp.gate_proj.weight": "model-00015-of-00019.safetensors",
702
+ "model.layers.62.mlp.up_proj.weight": "model-00015-of-00019.safetensors",
703
+ "model.layers.62.post_attention_layernorm.weight": "model-00015-of-00019.safetensors",
704
+ "model.layers.62.self_attn.k_proj.weight": "model-00015-of-00019.safetensors",
705
+ "model.layers.62.self_attn.o_proj.weight": "model-00015-of-00019.safetensors",
706
+ "model.layers.62.self_attn.q_proj.weight": "model-00015-of-00019.safetensors",
707
+ "model.layers.62.self_attn.v_proj.weight": "model-00015-of-00019.safetensors",
708
+ "model.layers.63.input_layernorm.weight": "model-00015-of-00019.safetensors",
709
+ "model.layers.63.mlp.down_proj.weight": "model-00015-of-00019.safetensors",
710
+ "model.layers.63.mlp.gate_proj.weight": "model-00015-of-00019.safetensors",
711
+ "model.layers.63.mlp.up_proj.weight": "model-00015-of-00019.safetensors",
712
+ "model.layers.63.post_attention_layernorm.weight": "model-00015-of-00019.safetensors",
713
+ "model.layers.63.self_attn.k_proj.weight": "model-00015-of-00019.safetensors",
714
+ "model.layers.63.self_attn.o_proj.weight": "model-00015-of-00019.safetensors",
715
+ "model.layers.63.self_attn.q_proj.weight": "model-00015-of-00019.safetensors",
716
+ "model.layers.63.self_attn.v_proj.weight": "model-00015-of-00019.safetensors",
717
+ "model.layers.64.input_layernorm.weight": "model-00015-of-00019.safetensors",
718
+ "model.layers.64.mlp.down_proj.weight": "model-00015-of-00019.safetensors",
719
+ "model.layers.64.mlp.gate_proj.weight": "model-00015-of-00019.safetensors",
720
+ "model.layers.64.mlp.up_proj.weight": "model-00015-of-00019.safetensors",
721
+ "model.layers.64.post_attention_layernorm.weight": "model-00015-of-00019.safetensors",
722
+ "model.layers.64.self_attn.k_proj.weight": "model-00015-of-00019.safetensors",
723
+ "model.layers.64.self_attn.o_proj.weight": "model-00015-of-00019.safetensors",
724
+ "model.layers.64.self_attn.q_proj.weight": "model-00015-of-00019.safetensors",
725
+ "model.layers.64.self_attn.v_proj.weight": "model-00015-of-00019.safetensors",
726
+ "model.layers.65.input_layernorm.weight": "model-00015-of-00019.safetensors",
727
+ "model.layers.65.mlp.down_proj.weight": "model-00015-of-00019.safetensors",
728
+ "model.layers.65.mlp.gate_proj.weight": "model-00015-of-00019.safetensors",
729
+ "model.layers.65.mlp.up_proj.weight": "model-00015-of-00019.safetensors",
730
+ "model.layers.65.post_attention_layernorm.weight": "model-00015-of-00019.safetensors",
731
+ "model.layers.65.self_attn.k_proj.weight": "model-00015-of-00019.safetensors",
732
+ "model.layers.65.self_attn.o_proj.weight": "model-00015-of-00019.safetensors",
733
+ "model.layers.65.self_attn.q_proj.weight": "model-00015-of-00019.safetensors",
734
+ "model.layers.65.self_attn.v_proj.weight": "model-00015-of-00019.safetensors",
735
+ "model.layers.66.input_layernorm.weight": "model-00015-of-00019.safetensors",
736
+ "model.layers.66.mlp.down_proj.weight": "model-00015-of-00019.safetensors",
737
+ "model.layers.66.mlp.gate_proj.weight": "model-00016-of-00019.safetensors",
738
+ "model.layers.66.mlp.up_proj.weight": "model-00016-of-00019.safetensors",
739
+ "model.layers.66.post_attention_layernorm.weight": "model-00016-of-00019.safetensors",
740
+ "model.layers.66.self_attn.k_proj.weight": "model-00016-of-00019.safetensors",
741
+ "model.layers.66.self_attn.o_proj.weight": "model-00016-of-00019.safetensors",
742
+ "model.layers.66.self_attn.q_proj.weight": "model-00016-of-00019.safetensors",
743
+ "model.layers.66.self_attn.v_proj.weight": "model-00016-of-00019.safetensors",
744
+ "model.layers.67.input_layernorm.weight": "model-00016-of-00019.safetensors",
745
+ "model.layers.67.mlp.down_proj.weight": "model-00016-of-00019.safetensors",
746
+ "model.layers.67.mlp.gate_proj.weight": "model-00016-of-00019.safetensors",
747
+ "model.layers.67.mlp.up_proj.weight": "model-00016-of-00019.safetensors",
748
+ "model.layers.67.post_attention_layernorm.weight": "model-00016-of-00019.safetensors",
749
+ "model.layers.67.self_attn.k_proj.weight": "model-00016-of-00019.safetensors",
750
+ "model.layers.67.self_attn.o_proj.weight": "model-00016-of-00019.safetensors",
751
+ "model.layers.67.self_attn.q_proj.weight": "model-00016-of-00019.safetensors",
752
+ "model.layers.67.self_attn.v_proj.weight": "model-00016-of-00019.safetensors",
753
+ "model.layers.68.input_layernorm.weight": "model-00016-of-00019.safetensors",
754
+ "model.layers.68.mlp.down_proj.weight": "model-00016-of-00019.safetensors",
755
+ "model.layers.68.mlp.gate_proj.weight": "model-00016-of-00019.safetensors",
756
+ "model.layers.68.mlp.up_proj.weight": "model-00016-of-00019.safetensors",
757
+ "model.layers.68.post_attention_layernorm.weight": "model-00016-of-00019.safetensors",
758
+ "model.layers.68.self_attn.k_proj.weight": "model-00016-of-00019.safetensors",
759
+ "model.layers.68.self_attn.o_proj.weight": "model-00016-of-00019.safetensors",
760
+ "model.layers.68.self_attn.q_proj.weight": "model-00016-of-00019.safetensors",
761
+ "model.layers.68.self_attn.v_proj.weight": "model-00016-of-00019.safetensors",
762
+ "model.layers.69.input_layernorm.weight": "model-00016-of-00019.safetensors",
763
+ "model.layers.69.mlp.down_proj.weight": "model-00016-of-00019.safetensors",
764
+ "model.layers.69.mlp.gate_proj.weight": "model-00016-of-00019.safetensors",
765
+ "model.layers.69.mlp.up_proj.weight": "model-00016-of-00019.safetensors",
766
+ "model.layers.69.post_attention_layernorm.weight": "model-00016-of-00019.safetensors",
767
+ "model.layers.69.self_attn.k_proj.weight": "model-00016-of-00019.safetensors",
768
+ "model.layers.69.self_attn.o_proj.weight": "model-00016-of-00019.safetensors",
769
+ "model.layers.69.self_attn.q_proj.weight": "model-00016-of-00019.safetensors",
770
+ "model.layers.69.self_attn.v_proj.weight": "model-00016-of-00019.safetensors",
771
+ "model.layers.7.input_layernorm.weight": "model-00016-of-00019.safetensors",
772
+ "model.layers.7.mlp.down_proj.weight": "model-00016-of-00019.safetensors",
773
+ "model.layers.7.mlp.gate_proj.weight": "model-00016-of-00019.safetensors",
774
+ "model.layers.7.mlp.up_proj.weight": "model-00017-of-00019.safetensors",
775
+ "model.layers.7.post_attention_layernorm.weight": "model-00017-of-00019.safetensors",
776
+ "model.layers.7.self_attn.k_proj.weight": "model-00017-of-00019.safetensors",
777
+ "model.layers.7.self_attn.o_proj.weight": "model-00017-of-00019.safetensors",
778
+ "model.layers.7.self_attn.q_proj.weight": "model-00017-of-00019.safetensors",
779
+ "model.layers.7.self_attn.v_proj.weight": "model-00017-of-00019.safetensors",
780
+ "model.layers.70.input_layernorm.weight": "model-00017-of-00019.safetensors",
781
+ "model.layers.70.mlp.down_proj.weight": "model-00017-of-00019.safetensors",
782
+ "model.layers.70.mlp.gate_proj.weight": "model-00017-of-00019.safetensors",
783
+ "model.layers.70.mlp.up_proj.weight": "model-00017-of-00019.safetensors",
784
+ "model.layers.70.post_attention_layernorm.weight": "model-00017-of-00019.safetensors",
785
+ "model.layers.70.self_attn.k_proj.weight": "model-00017-of-00019.safetensors",
786
+ "model.layers.70.self_attn.o_proj.weight": "model-00017-of-00019.safetensors",
787
+ "model.layers.70.self_attn.q_proj.weight": "model-00017-of-00019.safetensors",
788
+ "model.layers.70.self_attn.v_proj.weight": "model-00017-of-00019.safetensors",
789
+ "model.layers.71.input_layernorm.weight": "model-00017-of-00019.safetensors",
790
+ "model.layers.71.mlp.down_proj.weight": "model-00017-of-00019.safetensors",
791
+ "model.layers.71.mlp.gate_proj.weight": "model-00017-of-00019.safetensors",
792
+ "model.layers.71.mlp.up_proj.weight": "model-00017-of-00019.safetensors",
793
+ "model.layers.71.post_attention_layernorm.weight": "model-00017-of-00019.safetensors",
794
+ "model.layers.71.self_attn.k_proj.weight": "model-00017-of-00019.safetensors",
795
+ "model.layers.71.self_attn.o_proj.weight": "model-00017-of-00019.safetensors",
796
+ "model.layers.71.self_attn.q_proj.weight": "model-00017-of-00019.safetensors",
797
+ "model.layers.71.self_attn.v_proj.weight": "model-00017-of-00019.safetensors",
798
+ "model.layers.72.input_layernorm.weight": "model-00017-of-00019.safetensors",
799
+ "model.layers.72.mlp.down_proj.weight": "model-00017-of-00019.safetensors",
800
+ "model.layers.72.mlp.gate_proj.weight": "model-00017-of-00019.safetensors",
801
+ "model.layers.72.mlp.up_proj.weight": "model-00017-of-00019.safetensors",
802
+ "model.layers.72.post_attention_layernorm.weight": "model-00017-of-00019.safetensors",
803
+ "model.layers.72.self_attn.k_proj.weight": "model-00017-of-00019.safetensors",
804
+ "model.layers.72.self_attn.o_proj.weight": "model-00017-of-00019.safetensors",
805
+ "model.layers.72.self_attn.q_proj.weight": "model-00017-of-00019.safetensors",
806
+ "model.layers.72.self_attn.v_proj.weight": "model-00017-of-00019.safetensors",
807
+ "model.layers.73.input_layernorm.weight": "model-00017-of-00019.safetensors",
808
+ "model.layers.73.mlp.down_proj.weight": "model-00017-of-00019.safetensors",
809
+ "model.layers.73.mlp.gate_proj.weight": "model-00017-of-00019.safetensors",
810
+ "model.layers.73.mlp.up_proj.weight": "model-00017-of-00019.safetensors",
811
+ "model.layers.73.post_attention_layernorm.weight": "model-00017-of-00019.safetensors",
812
+ "model.layers.73.self_attn.k_proj.weight": "model-00017-of-00019.safetensors",
813
+ "model.layers.73.self_attn.o_proj.weight": "model-00017-of-00019.safetensors",
814
+ "model.layers.73.self_attn.q_proj.weight": "model-00018-of-00019.safetensors",
815
+ "model.layers.73.self_attn.v_proj.weight": "model-00018-of-00019.safetensors",
816
+ "model.layers.74.input_layernorm.weight": "model-00018-of-00019.safetensors",
817
+ "model.layers.74.mlp.down_proj.weight": "model-00018-of-00019.safetensors",
818
+ "model.layers.74.mlp.gate_proj.weight": "model-00018-of-00019.safetensors",
819
+ "model.layers.74.mlp.up_proj.weight": "model-00018-of-00019.safetensors",
820
+ "model.layers.74.post_attention_layernorm.weight": "model-00018-of-00019.safetensors",
821
+ "model.layers.74.self_attn.k_proj.weight": "model-00018-of-00019.safetensors",
822
+ "model.layers.74.self_attn.o_proj.weight": "model-00018-of-00019.safetensors",
823
+ "model.layers.74.self_attn.q_proj.weight": "model-00018-of-00019.safetensors",
824
+ "model.layers.74.self_attn.v_proj.weight": "model-00018-of-00019.safetensors",
825
+ "model.layers.75.input_layernorm.weight": "model-00018-of-00019.safetensors",
826
+ "model.layers.75.mlp.down_proj.weight": "model-00018-of-00019.safetensors",
827
+ "model.layers.75.mlp.gate_proj.weight": "model-00018-of-00019.safetensors",
828
+ "model.layers.75.mlp.up_proj.weight": "model-00018-of-00019.safetensors",
829
+ "model.layers.75.post_attention_layernorm.weight": "model-00018-of-00019.safetensors",
830
+ "model.layers.75.self_attn.k_proj.weight": "model-00018-of-00019.safetensors",
831
+ "model.layers.75.self_attn.o_proj.weight": "model-00018-of-00019.safetensors",
832
+ "model.layers.75.self_attn.q_proj.weight": "model-00018-of-00019.safetensors",
833
+ "model.layers.75.self_attn.v_proj.weight": "model-00018-of-00019.safetensors",
834
+ "model.layers.76.input_layernorm.weight": "model-00018-of-00019.safetensors",
835
+ "model.layers.76.mlp.down_proj.weight": "model-00018-of-00019.safetensors",
836
+ "model.layers.76.mlp.gate_proj.weight": "model-00018-of-00019.safetensors",
837
+ "model.layers.76.mlp.up_proj.weight": "model-00018-of-00019.safetensors",
838
+ "model.layers.76.post_attention_layernorm.weight": "model-00018-of-00019.safetensors",
839
+ "model.layers.76.self_attn.k_proj.weight": "model-00018-of-00019.safetensors",
840
+ "model.layers.76.self_attn.o_proj.weight": "model-00018-of-00019.safetensors",
841
+ "model.layers.76.self_attn.q_proj.weight": "model-00018-of-00019.safetensors",
842
+ "model.layers.76.self_attn.v_proj.weight": "model-00018-of-00019.safetensors",
843
+ "model.layers.77.input_layernorm.weight": "model-00018-of-00019.safetensors",
844
+ "model.layers.77.mlp.down_proj.weight": "model-00018-of-00019.safetensors",
845
+ "model.layers.77.mlp.gate_proj.weight": "model-00018-of-00019.safetensors",
846
+ "model.layers.77.mlp.up_proj.weight": "model-00018-of-00019.safetensors",
847
+ "model.layers.77.post_attention_layernorm.weight": "model-00018-of-00019.safetensors",
848
+ "model.layers.77.self_attn.k_proj.weight": "model-00018-of-00019.safetensors",
849
+ "model.layers.77.self_attn.o_proj.weight": "model-00018-of-00019.safetensors",
850
+ "model.layers.77.self_attn.q_proj.weight": "model-00018-of-00019.safetensors",
851
+ "model.layers.77.self_attn.v_proj.weight": "model-00018-of-00019.safetensors",
852
+ "model.layers.78.input_layernorm.weight": "model-00018-of-00019.safetensors",
853
+ "model.layers.78.mlp.down_proj.weight": "model-00018-of-00019.safetensors",
854
+ "model.layers.78.mlp.gate_proj.weight": "model-00019-of-00019.safetensors",
855
+ "model.layers.78.mlp.up_proj.weight": "model-00019-of-00019.safetensors",
856
+ "model.layers.78.post_attention_layernorm.weight": "model-00019-of-00019.safetensors",
857
+ "model.layers.78.self_attn.k_proj.weight": "model-00019-of-00019.safetensors",
858
+ "model.layers.78.self_attn.o_proj.weight": "model-00019-of-00019.safetensors",
859
+ "model.layers.78.self_attn.q_proj.weight": "model-00019-of-00019.safetensors",
860
+ "model.layers.78.self_attn.v_proj.weight": "model-00019-of-00019.safetensors",
861
+ "model.layers.79.input_layernorm.weight": "model-00019-of-00019.safetensors",
862
+ "model.layers.79.mlp.down_proj.weight": "model-00019-of-00019.safetensors",
863
+ "model.layers.79.mlp.gate_proj.weight": "model-00019-of-00019.safetensors",
864
+ "model.layers.79.mlp.up_proj.weight": "model-00019-of-00019.safetensors",
865
+ "model.layers.79.post_attention_layernorm.weight": "model-00019-of-00019.safetensors",
866
+ "model.layers.79.self_attn.k_proj.weight": "model-00019-of-00019.safetensors",
867
+ "model.layers.79.self_attn.o_proj.weight": "model-00019-of-00019.safetensors",
868
+ "model.layers.79.self_attn.q_proj.weight": "model-00019-of-00019.safetensors",
869
+ "model.layers.79.self_attn.v_proj.weight": "model-00019-of-00019.safetensors",
870
+ "model.layers.8.input_layernorm.weight": "model-00019-of-00019.safetensors",
871
+ "model.layers.8.mlp.down_proj.weight": "model-00019-of-00019.safetensors",
872
+ "model.layers.8.mlp.gate_proj.weight": "model-00019-of-00019.safetensors",
873
+ "model.layers.8.mlp.up_proj.weight": "model-00019-of-00019.safetensors",
874
+ "model.layers.8.post_attention_layernorm.weight": "model-00019-of-00019.safetensors",
875
+ "model.layers.8.self_attn.k_proj.weight": "model-00019-of-00019.safetensors",
876
+ "model.layers.8.self_attn.o_proj.weight": "model-00019-of-00019.safetensors",
877
+ "model.layers.8.self_attn.q_proj.weight": "model-00019-of-00019.safetensors",
878
+ "model.layers.8.self_attn.v_proj.weight": "model-00019-of-00019.safetensors",
879
+ "model.layers.9.input_layernorm.weight": "model-00019-of-00019.safetensors",
880
+ "model.layers.9.mlp.down_proj.weight": "model-00019-of-00019.safetensors",
881
+ "model.layers.9.mlp.gate_proj.weight": "model-00019-of-00019.safetensors",
882
+ "model.layers.9.mlp.up_proj.weight": "model-00019-of-00019.safetensors",
883
+ "model.layers.9.post_attention_layernorm.weight": "model-00019-of-00019.safetensors",
884
+ "model.layers.9.self_attn.k_proj.weight": "model-00019-of-00019.safetensors",
885
+ "model.layers.9.self_attn.o_proj.weight": "model-00019-of-00019.safetensors",
886
+ "model.layers.9.self_attn.q_proj.weight": "model-00019-of-00019.safetensors",
887
+ "model.layers.9.self_attn.v_proj.weight": "model-00019-of-00019.safetensors",
888
+ "model.norm.weight": "model-00019-of-00019.safetensors"
889
+ }
890
+ }
modeling_iquestloopcoder.py ADDED
@@ -0,0 +1,1421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Modified MIT License
3
+
4
+ Software Copyright© 2025 IQuest Research
5
+
6
+ Our only modification is that, if the Software (or any derivative works
7
+ thereof) is used for any of your commercial products or services, you shall
8
+ prominently display "IQuest Coder" on the user interface of such product or
9
+ service.
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy
11
+ of this software and associated documentation files (the "Software"), to deal
12
+ in the Software without restriction, including without limitation the rights
13
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
14
+ copies of the Software, and to permit persons to whom the Software is
15
+ furnished to do so, subject to the following conditions:
16
+
17
+ The above copyright notice and this permission notice shall be included in all
18
+ copies or substantial portions of the Software.
19
+
20
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
21
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
22
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
23
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
24
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
25
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
26
+ """
27
+
28
+ import math
29
+ from typing import Any, List, Optional, Tuple, Union
30
+
31
+ import torch
32
+ import torch.nn.functional as F
33
+ import torch.utils.checkpoint
34
+ from torch import nn
35
+
36
+ from transformers.activations import ACT2FN
37
+ from transformers.cache_utils import Cache, DynamicCache, StaticCache
38
+ from transformers.modeling_attn_mask_utils import AttentionMaskConverter
39
+ from transformers.modeling_outputs import (
40
+ BaseModelOutputWithPast,
41
+ CausalLMOutputWithPast,
42
+ )
43
+ from transformers.modeling_utils import PreTrainedModel
44
+ from transformers.generation.utils import GenerationMixin
45
+ from transformers.utils import (
46
+ add_start_docstrings,
47
+ add_start_docstrings_to_model_forward,
48
+ logging,
49
+ replace_return_docstrings,
50
+ )
51
+
52
+ from .configuration_iquestloopcoder import IQuestLoopCoderConfig
53
+
54
+ logger = logging.get_logger(__name__)
55
+
56
+ _CONFIG_FOR_DOC = "IQuestLoopCoderConfig"
57
+
58
+
59
+ class IQuestLoopCoderCache(Cache):
60
+ """Cache implementation for IQuestLoopCoder that manages shared and local KV caches.
61
+
62
+ - shared_key_cache/shared_value_cache: Stores KV from Loop 1 (global context)
63
+ - local_key_cache/local_value_cache: Stores KV from Loop 2+ (local window, only window_size tokens)
64
+ """
65
+
66
+ def __init__(self, window_size: int, num_layers: int):
67
+ # We intentionally don't call super().__init__ because the parent assumes static cache sizes.
68
+ self.window_size = window_size
69
+ self.num_layers = num_layers
70
+
71
+ # Shared cache: stores Loop 1 KV (global context)
72
+ self.shared_key_cache: List[Optional[torch.Tensor]] = [None] * num_layers
73
+ self.shared_value_cache: List[Optional[torch.Tensor]] = [None] * num_layers
74
+
75
+ # Local cache: stores Loop 2+ KV (sliding window, only window_size tokens)
76
+ self.local_key_cache: List[Optional[torch.Tensor]] = [None] * num_layers
77
+ self.local_value_cache: List[Optional[torch.Tensor]] = [None] * num_layers
78
+
79
+ self.layers: List[Any] = [] # attribute expected by HF Cache utilities
80
+ self._seen_tokens = 0
81
+
82
+ def update_shared(
83
+ self,
84
+ key_states: torch.Tensor,
85
+ value_states: torch.Tensor,
86
+ layer_idx: int,
87
+ cache_kwargs: Optional[dict] = None,
88
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
89
+ """Update shared cache (Loop 1 KV)."""
90
+ if layer_idx < 0 or layer_idx >= self.num_layers:
91
+ raise ValueError(f"layer_idx must be in [0, {self.num_layers}), got {layer_idx}")
92
+
93
+ cached_key = self.shared_key_cache[layer_idx]
94
+ cached_value = self.shared_value_cache[layer_idx]
95
+
96
+ if cached_key is None:
97
+ self.shared_key_cache[layer_idx] = key_states
98
+ self.shared_value_cache[layer_idx] = value_states
99
+ else:
100
+ if (
101
+ key_states.shape[0] != cached_key.shape[0]
102
+ or key_states.shape[1] != cached_key.shape[1]
103
+ or key_states.shape[3] != cached_key.shape[3]
104
+ ):
105
+ raise ValueError(
106
+ "Cached and incoming key/value tensors must match on batch, head, and head_dim dimensions."
107
+ )
108
+ assert cached_value is not None
109
+ self.shared_key_cache[layer_idx] = torch.cat([cached_key, key_states], dim=2)
110
+ self.shared_value_cache[layer_idx] = torch.cat([cached_value, value_states], dim=2)
111
+
112
+ result_key = self.shared_key_cache[layer_idx]
113
+ result_value = self.shared_value_cache[layer_idx]
114
+ assert result_key is not None and result_value is not None
115
+
116
+ # Track sequence length
117
+ self._seen_tokens = result_key.shape[2]
118
+ return result_key, result_value
119
+
120
+ def update_local(
121
+ self,
122
+ key_states: torch.Tensor,
123
+ value_states: torch.Tensor,
124
+ layer_idx: int,
125
+ cache_kwargs: Optional[dict] = None,
126
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
127
+ """Update local cache (Loop 2+ KV) with sliding window management.
128
+
129
+ If the cache is full (window_size tokens), remove the oldest token and add the new one.
130
+ """
131
+ if layer_idx < 0 or layer_idx >= self.num_layers:
132
+ raise ValueError(f"layer_idx must be in [0, {self.num_layers}), got {layer_idx}")
133
+
134
+ cached_key = self.local_key_cache[layer_idx]
135
+ cached_value = self.local_value_cache[layer_idx]
136
+
137
+ if cached_key is None:
138
+ # First token in local cache
139
+ self.local_key_cache[layer_idx] = key_states
140
+ self.local_value_cache[layer_idx] = value_states
141
+ else:
142
+ if (
143
+ key_states.shape[0] != cached_key.shape[0]
144
+ or key_states.shape[1] != cached_key.shape[1]
145
+ or key_states.shape[3] != cached_key.shape[3]
146
+ ):
147
+ raise ValueError(
148
+ "Cached and incoming key/value tensors must match on batch, head, and head_dim dimensions."
149
+ )
150
+ assert cached_value is not None
151
+
152
+ # Check if we need to remove the oldest token
153
+ current_len = cached_key.shape[2]
154
+ if current_len >= self.window_size:
155
+ # Remove the first token (oldest) and add the new one
156
+ self.local_key_cache[layer_idx] = torch.cat([cached_key[:, :, 1:, :], key_states], dim=2)
157
+ self.local_value_cache[layer_idx] = torch.cat([cached_value[:, :, 1:, :], value_states], dim=2)
158
+ else:
159
+ # Just append
160
+ self.local_key_cache[layer_idx] = torch.cat([cached_key, key_states], dim=2)
161
+ self.local_value_cache[layer_idx] = torch.cat([cached_value, value_states], dim=2)
162
+
163
+ result_key = self.local_key_cache[layer_idx]
164
+ result_value = self.local_value_cache[layer_idx]
165
+ assert result_key is not None and result_value is not None
166
+
167
+ return result_key, result_value
168
+
169
+ def get_shared(self, layer_idx: int) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]:
170
+ """Get shared cache for a layer."""
171
+ if layer_idx < 0 or layer_idx >= self.num_layers:
172
+ return None, None
173
+ return self.shared_key_cache[layer_idx], self.shared_value_cache[layer_idx]
174
+
175
+ def get_local(self, layer_idx: int) -> Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]:
176
+ """Get local cache for a layer."""
177
+ if layer_idx < 0 or layer_idx >= self.num_layers:
178
+ return None, None
179
+ return self.local_key_cache[layer_idx], self.local_value_cache[layer_idx]
180
+
181
+ def update(
182
+ self,
183
+ key_states: torch.Tensor,
184
+ value_states: torch.Tensor,
185
+ layer_idx: int,
186
+ cache_kwargs: Optional[dict] = None,
187
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
188
+ """Default update method (for compatibility, updates shared cache)."""
189
+ return self.update_shared(key_states, value_states, layer_idx, cache_kwargs)
190
+
191
+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int:
192
+ """Get sequence length from shared cache."""
193
+ if layer_idx is None:
194
+ layer_idx = 0
195
+ if layer_idx < 0 or layer_idx >= len(self.shared_key_cache):
196
+ return 0
197
+ cached = self.shared_key_cache[layer_idx]
198
+ if cached is None:
199
+ return 0
200
+ return cached.shape[2]
201
+
202
+ def get_max_length(self) -> Optional[int]:
203
+ return None
204
+
205
+ def get_usable_length(
206
+ self, new_seq_length: int, layer_idx: Optional[int] = 0
207
+ ) -> int:
208
+ return self.get_seq_length(layer_idx)
209
+
210
+ def reorder_cache(self, beam_idx: torch.LongTensor) -> None:
211
+ """Reorder cache for beam search."""
212
+ for layer_idx in range(self.num_layers):
213
+ if self.shared_key_cache[layer_idx] is not None:
214
+ device = self.shared_key_cache[layer_idx].device
215
+ self.shared_key_cache[layer_idx] = self.shared_key_cache[layer_idx].index_select(0, beam_idx.to(device))
216
+ self.shared_value_cache[layer_idx] = self.shared_value_cache[layer_idx].index_select(0, beam_idx.to(device))
217
+
218
+ if self.local_key_cache[layer_idx] is not None:
219
+ device = self.local_key_cache[layer_idx].device
220
+ self.local_key_cache[layer_idx] = self.local_key_cache[layer_idx].index_select(0, beam_idx.to(device))
221
+ self.local_value_cache[layer_idx] = self.local_value_cache[layer_idx].index_select(0, beam_idx.to(device))
222
+
223
+ @property
224
+ def is_compileable(self) -> bool:
225
+ return False
226
+
227
+ def clear(self) -> None:
228
+ """Clear all caches."""
229
+ logger.debug("Clearing IQuestLoopCoderCache")
230
+ self.shared_key_cache = [None] * self.num_layers
231
+ self.shared_value_cache = [None] * self.num_layers
232
+ self.local_key_cache = [None] * self.num_layers
233
+ self.local_value_cache = [None] * self.num_layers
234
+ self._seen_tokens = 0
235
+
236
+
237
+ class IQuestLoopCoderRMSNorm(nn.Module):
238
+ """RMS Normalization layer."""
239
+
240
+ def __init__(self, hidden_size, eps=1e-6):
241
+ super().__init__()
242
+ self.weight = nn.Parameter(torch.ones(hidden_size))
243
+ self.variance_epsilon = eps
244
+
245
+ def forward(self, hidden_states):
246
+ input_dtype = hidden_states.dtype
247
+ hidden_states = hidden_states.to(torch.float32)
248
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
249
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
250
+ return self.weight * hidden_states.to(input_dtype)
251
+
252
+
253
+ class IQuestLoopCoderRotaryEmbedding(nn.Module):
254
+ """Rotary Position Embedding (RoPE)."""
255
+
256
+ def __init__(self, dim, max_position_embeddings=8192, base=500000.0, device=None, scaling_factor=1.0):
257
+ super().__init__()
258
+ self.scaling_factor = scaling_factor
259
+ self.dim = dim
260
+ self.max_position_embeddings = max_position_embeddings
261
+ self.base = base
262
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
263
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
264
+ self.max_seq_len_cached = max_position_embeddings
265
+
266
+ @torch.no_grad()
267
+ def forward(self, x, position_ids):
268
+ # x: [batch_size, num_heads, seq_len, head_dim]
269
+ inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
270
+ position_ids_expanded = position_ids[:, None, :].float()
271
+
272
+ device_type = x.device.type
273
+ with torch.autocast(device_type=device_type, enabled=False):
274
+ freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
275
+ emb = torch.cat((freqs, freqs), dim=-1)
276
+ cos = emb.cos()
277
+ sin = emb.sin()
278
+ return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
279
+
280
+
281
+ def rotate_half(x):
282
+ """Rotates half the hidden dims of the input."""
283
+ x1 = x[..., : x.shape[-1] // 2]
284
+ x2 = x[..., x.shape[-1] // 2 :]
285
+ return torch.cat((-x2, x1), dim=-1)
286
+
287
+
288
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
289
+ """Applies Rotary Position Embedding to the query and key tensors."""
290
+ cos = cos.unsqueeze(unsqueeze_dim)
291
+ sin = sin.unsqueeze(unsqueeze_dim)
292
+ q_embed = (q * cos) + (rotate_half(q) * sin)
293
+ k_embed = (k * cos) + (rotate_half(k) * sin)
294
+ return q_embed, k_embed
295
+
296
+
297
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
298
+ """Expand KV heads to match query heads for GQA."""
299
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
300
+ if n_rep == 1:
301
+ return hidden_states
302
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
303
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
304
+
305
+
306
+ class IQuestLoopCoderMLP(nn.Module):
307
+ """MLP with SwiGLU activation."""
308
+
309
+ def __init__(self, config):
310
+ super().__init__()
311
+ self.config = config
312
+ self.hidden_size = config.hidden_size
313
+ self.intermediate_size = config.intermediate_size
314
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
315
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
316
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
317
+ self.act_fn = ACT2FN[config.hidden_act]
318
+
319
+ def forward(self, x):
320
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
321
+
322
+
323
+ class LoopGateProjection(nn.Module):
324
+ """Gate projection for mixed attention in Loop 2+.
325
+
326
+ Computes: g = sigmoid(linear(Q)) for each head independently.
327
+ This gate determines how much to use Loop1's KV (global) vs current loop's KV (local).
328
+ """
329
+
330
+ def __init__(self, num_heads: int, head_dim: int):
331
+ super().__init__()
332
+ self.num_heads = num_heads
333
+ self.head_dim = head_dim
334
+ # Each head has its own gate: Linear(head_dim -> 1) per head
335
+ # Implemented as [num_heads, head_dim] weight + [num_heads] bias
336
+ self.weight = nn.Parameter(torch.zeros(num_heads, head_dim))
337
+ self.bias = nn.Parameter(torch.zeros(num_heads))
338
+
339
+ def forward(self, query: torch.Tensor) -> torch.Tensor:
340
+ """Compute gate values from query tensor.
341
+
342
+ Args:
343
+ query: [batch, num_heads, seq_len, head_dim]
344
+
345
+ Returns:
346
+ gate: [batch, num_heads, seq_len, 1]
347
+ """
348
+ # query: [batch, num_heads, seq_len, head_dim]
349
+ # weight: [num_heads, head_dim]
350
+ # For each head h: gate_h = query[:, h, :, :] @ weight[h, :].T + bias[h]
351
+ # Using einsum: gate = einsum('bhsd,hd->bhs', query, weight) + bias
352
+ gate_logits = torch.einsum('bhsd,hd->bhs', query, self.weight) # [batch, num_heads, seq_len]
353
+ gate_logits = gate_logits + self.bias[None, :, None] # broadcast bias
354
+ gate = torch.sigmoid(gate_logits)
355
+ return gate.unsqueeze(-1) # [batch, num_heads, seq_len, 1]
356
+
357
+
358
+ class IQuestLoopCoderAttention(nn.Module):
359
+ """Multi-head attention with GQA support."""
360
+
361
+ def __init__(self, config: IQuestLoopCoderConfig, layer_idx: Optional[int] = None):
362
+ super().__init__()
363
+ self.config = config
364
+ self.layer_idx = layer_idx
365
+
366
+ self.hidden_size = config.hidden_size
367
+ self.num_heads = config.num_attention_heads
368
+ self.head_dim = config.head_dim
369
+ self.num_key_value_heads = config.num_key_value_heads
370
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
371
+ self.max_position_embeddings = config.max_position_embeddings
372
+ self.rope_theta = config.rope_theta
373
+ self.attention_dropout = config.attention_dropout
374
+
375
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
376
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
377
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
378
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias)
379
+
380
+ self.rotary_emb = IQuestLoopCoderRotaryEmbedding(
381
+ self.head_dim,
382
+ max_position_embeddings=self.max_position_embeddings,
383
+ base=self.rope_theta,
384
+ )
385
+
386
+ def forward(
387
+ self,
388
+ hidden_states: torch.Tensor,
389
+ attention_mask: Optional[torch.Tensor] = None,
390
+ position_ids: Optional[torch.LongTensor] = None,
391
+ past_key_value: Optional[Cache] = None,
392
+ output_attentions: bool = False,
393
+ use_cache: bool = False,
394
+ cache_position: Optional[torch.LongTensor] = None,
395
+ **kwargs,
396
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
397
+ bsz, q_len, _ = hidden_states.size()
398
+
399
+ query_states = self.q_proj(hidden_states)
400
+ key_states = self.k_proj(hidden_states)
401
+ value_states = self.v_proj(hidden_states)
402
+
403
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
404
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
405
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
406
+
407
+ cos, sin = self.rotary_emb(value_states, position_ids)
408
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
409
+
410
+ if past_key_value is not None:
411
+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
412
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
413
+
414
+ # Repeat KV for GQA
415
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
416
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
417
+
418
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
419
+
420
+ if attention_mask is not None:
421
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
422
+ attn_weights = attn_weights + causal_mask
423
+
424
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
425
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
426
+ attn_output = torch.matmul(attn_weights, value_states)
427
+
428
+ attn_output = attn_output.transpose(1, 2).contiguous()
429
+ attn_output = attn_output.reshape(bsz, q_len, -1)
430
+ attn_output = self.o_proj(attn_output)
431
+
432
+ return attn_output, attn_weights if output_attentions else None, past_key_value
433
+
434
+ def forward_with_external_kv(
435
+ self,
436
+ hidden_states: torch.Tensor,
437
+ external_key: torch.Tensor,
438
+ external_value: torch.Tensor,
439
+ attention_mask: Optional[torch.Tensor] = None,
440
+ position_ids: Optional[torch.LongTensor] = None,
441
+ sliding_window: Optional[int] = None,
442
+ ) -> torch.Tensor:
443
+ """Forward pass using external K, V (for Loop 2+ mixed attention).
444
+
445
+ Args:
446
+ hidden_states: Input for computing Q
447
+ external_key: Pre-computed K (already with RoPE applied)
448
+ external_value: Pre-computed V
449
+ attention_mask: Causal attention mask
450
+ position_ids: Position IDs
451
+ sliding_window: If set, apply sliding window attention
452
+
453
+ Returns:
454
+ Attention output [batch, seq_len, num_heads, head_dim]
455
+ """
456
+ bsz, q_len, _ = hidden_states.size()
457
+
458
+ # Compute Q from current hidden states
459
+ query_states = self.q_proj(hidden_states)
460
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
461
+
462
+ # Apply RoPE to Q
463
+ cos, sin = self.rotary_emb(query_states, position_ids)
464
+ query_states = (query_states * cos.unsqueeze(1)) + (rotate_half(query_states) * sin.unsqueeze(1))
465
+
466
+ # Use external K, V (already have RoPE for K)
467
+ key_states = external_key
468
+ value_states = external_value
469
+
470
+ # Repeat KV for GQA
471
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
472
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
473
+
474
+ # Compute attention
475
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
476
+
477
+ # Apply attention mask (causal)
478
+ if attention_mask is not None:
479
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
480
+ attn_weights = attn_weights + causal_mask
481
+
482
+ # Apply sliding window mask if needed
483
+ if sliding_window is not None and q_len > sliding_window:
484
+ # Create sliding window mask
485
+ # For each position i, can only attend to [i-window+1, i]
486
+ seq_len = key_states.shape[2]
487
+ row_idx = torch.arange(q_len, device=query_states.device).unsqueeze(1)
488
+ col_idx = torch.arange(seq_len, device=query_states.device).unsqueeze(0)
489
+ window_mask = (col_idx > row_idx) | (col_idx < row_idx - sliding_window + 1)
490
+ window_mask = window_mask.unsqueeze(0).unsqueeze(0) # [1, 1, q_len, seq_len]
491
+ attn_weights = attn_weights.masked_fill(window_mask, float('-inf'))
492
+
493
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
494
+ attn_output = torch.matmul(attn_weights, value_states)
495
+
496
+ # Don't apply o_proj here - return raw attention output
497
+ attn_output = attn_output.transpose(1, 2).contiguous()
498
+ return attn_output # [batch, seq_len, num_heads, head_dim]
499
+
500
+ def get_qkv(
501
+ self,
502
+ hidden_states: torch.Tensor,
503
+ position_ids: Optional[torch.LongTensor] = None,
504
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
505
+ """Get Q, K, V tensors with RoPE applied.
506
+
507
+ Returns:
508
+ query: [batch, num_heads, seq_len, head_dim]
509
+ key: [batch, num_kv_heads, seq_len, head_dim]
510
+ value: [batch, num_kv_heads, seq_len, head_dim]
511
+ """
512
+ bsz, q_len, _ = hidden_states.size()
513
+
514
+ query_states = self.q_proj(hidden_states)
515
+ key_states = self.k_proj(hidden_states)
516
+ value_states = self.v_proj(hidden_states)
517
+
518
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
519
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
520
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
521
+
522
+ cos, sin = self.rotary_emb(value_states, position_ids)
523
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
524
+
525
+ return query_states, key_states, value_states
526
+
527
+ def forward_decode_loop1(
528
+ self,
529
+ hidden_states: torch.Tensor,
530
+ past_shared_key: Optional[torch.Tensor],
531
+ past_shared_value: Optional[torch.Tensor],
532
+ attention_mask: Optional[torch.Tensor] = None,
533
+ position_ids: Optional[torch.LongTensor] = None,
534
+ cache_position: Optional[torch.LongTensor] = None,
535
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
536
+ """Forward pass for Loop 1 in decode stage.
537
+
538
+ Args:
539
+ hidden_states: Current hidden states [batch, 1, hidden_size]
540
+ past_shared_key: Past shared keys from cache [batch, num_kv_heads, past_len, head_dim]
541
+ past_shared_value: Past shared values from cache [batch, num_kv_heads, past_len, head_dim]
542
+ attention_mask: Causal attention mask
543
+ position_ids: Position IDs
544
+ cache_position: Cache position
545
+
546
+ Returns:
547
+ output: Attention output [batch, 1, hidden_size]
548
+ k1: Current key [batch, num_kv_heads, 1, head_dim] (only current token)
549
+ v1: Current value [batch, num_kv_heads, 1, head_dim] (only current token)
550
+ """
551
+ bsz, q_len, _ = hidden_states.size()
552
+
553
+ query_states = self.q_proj(hidden_states)
554
+ key_states = self.k_proj(hidden_states)
555
+ value_states = self.v_proj(hidden_states)
556
+
557
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
558
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
559
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
560
+
561
+ cos, sin = self.rotary_emb(value_states, position_ids)
562
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
563
+
564
+ # Store current token's k1, v1 for return (before concatenation)
565
+ k1_current = key_states # [batch, num_kv_heads, 1, head_dim]
566
+ v1_current = value_states # [batch, num_kv_heads, 1, head_dim]
567
+
568
+ # Concatenate with past shared KV cache for attention computation
569
+ if past_shared_key is not None and past_shared_value is not None:
570
+ key_states = torch.cat([past_shared_key, key_states], dim=2)
571
+ value_states = torch.cat([past_shared_value, value_states], dim=2)
572
+
573
+ # Repeat KV for GQA
574
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
575
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
576
+
577
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
578
+
579
+ if attention_mask is not None:
580
+ causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
581
+ attn_weights = attn_weights + causal_mask
582
+
583
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
584
+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
585
+ attn_output = torch.matmul(attn_weights, value_states)
586
+
587
+ attn_output = attn_output.transpose(1, 2).contiguous()
588
+ attn_output = attn_output.reshape(bsz, q_len, -1)
589
+ attn_output = self.o_proj(attn_output)
590
+
591
+ return attn_output, k1_current, v1_current
592
+
593
+ def forward_decode_loop2(
594
+ self,
595
+ hidden_states: torch.Tensor,
596
+ k1: torch.Tensor,
597
+ v1: torch.Tensor,
598
+ past_shared_key: Optional[torch.Tensor],
599
+ past_shared_value: Optional[torch.Tensor],
600
+ past_local_key: Optional[torch.Tensor],
601
+ past_local_value: Optional[torch.Tensor],
602
+ gate_proj: LoopGateProjection,
603
+ attention_mask: Optional[torch.Tensor] = None,
604
+ position_ids: Optional[torch.LongTensor] = None,
605
+ loop_window_size: int = 64,
606
+ ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
607
+ """Forward pass for Loop 2 in decode stage with mixed attention.
608
+
609
+ Args:
610
+ hidden_states: Current hidden states [batch, 1, hidden_size]
611
+ k1: Key from Loop 1 (current token) [batch, num_kv_heads, 1, head_dim]
612
+ v1: Value from Loop 1 (current token) [batch, num_kv_heads, 1, head_dim]
613
+ past_shared_key: Past shared keys from cache [batch, num_kv_heads, past_len, head_dim]
614
+ past_shared_value: Past shared values from cache [batch, num_kv_heads, past_len, head_dim]
615
+ past_local_key: Past local keys from cache [batch, num_kv_heads, window_len, head_dim]
616
+ past_local_value: Past local values from cache [batch, num_kv_heads, window_len, head_dim]
617
+ gate_proj: Gate projection module
618
+ attention_mask: Causal attention mask
619
+ position_ids: Position IDs
620
+ loop_window_size: Window size for sliding window attention
621
+
622
+ Returns:
623
+ output: Attention output [batch, 1, hidden_size]
624
+ k2: Current key [batch, num_kv_heads, 1, head_dim]
625
+ v2: Current value [batch, num_kv_heads, 1, head_dim]
626
+ """
627
+ bsz, q_len, _ = hidden_states.size()
628
+
629
+ # Get Q2, K2, V2 for current loop
630
+ q2, k2, v2 = self.get_qkv(hidden_states, position_ids)
631
+
632
+ # Compute gate: g = sigmoid(linear(Q2))
633
+ gate = gate_proj(q2) # [batch, num_heads, 1, 1]
634
+
635
+ # For attention A: concatenate past shared KV with current k1, v1 (full global context)
636
+ if past_shared_key is not None and past_shared_value is not None:
637
+ k1_full = torch.cat([past_shared_key, k1], dim=2)
638
+ v1_full = torch.cat([past_shared_value, v1], dim=2)
639
+ else:
640
+ k1_full = k1
641
+ v1_full = v1
642
+
643
+ # For attention B: concatenate past local KV with current k2, v2 (sliding window)
644
+ if past_local_key is not None and past_local_value is not None:
645
+ k2_full = torch.cat([past_local_key, k2], dim=2)
646
+ v2_full = torch.cat([past_local_value, v2], dim=2)
647
+ else:
648
+ k2_full = k2
649
+ v2_full = v2
650
+
651
+ # Repeat KV for GQA
652
+ k1_expanded = repeat_kv(k1_full, self.num_key_value_groups)
653
+ v1_expanded = repeat_kv(v1_full, self.num_key_value_groups)
654
+ k2_expanded = repeat_kv(k2_full, self.num_key_value_groups)
655
+ v2_expanded = repeat_kv(v2_full, self.num_key_value_groups)
656
+
657
+ # Attention A: Q2 @ K1_full, V1_full (global, full sequence)
658
+ head_dim = q2.shape[-1]
659
+ attn_weights_A = torch.matmul(q2, k1_expanded.transpose(2, 3)) / math.sqrt(head_dim)
660
+ if attention_mask is not None:
661
+ causal_mask = attention_mask[:, :, :, : k1_expanded.shape[-2]]
662
+ attn_weights_A = attn_weights_A + causal_mask
663
+ attn_weights_A = nn.functional.softmax(attn_weights_A, dim=-1, dtype=torch.float32).to(q2.dtype)
664
+ attn_A = torch.matmul(attn_weights_A, v1_expanded)
665
+
666
+ # Attention B: Q2 @ K2_full, V2_full (local sliding window)
667
+ attn_weights_B = torch.matmul(q2, k2_expanded.transpose(2, 3)) / math.sqrt(head_dim)
668
+ if attention_mask is not None:
669
+ causal_mask = attention_mask[:, :, :, : k2_expanded.shape[-2]]
670
+ attn_weights_B = attn_weights_B + causal_mask
671
+
672
+ # Apply sliding window mask
673
+ q_len_attn = q2.shape[2]
674
+ k_len_attn = k2_expanded.shape[2]
675
+ if q_len_attn <= loop_window_size:
676
+ # If sequence fits in window, use standard attention
677
+ attn_weights_B = nn.functional.softmax(attn_weights_B, dim=-1, dtype=torch.float32).to(q2.dtype)
678
+ else:
679
+ # Apply sliding window mask
680
+ row_idx = torch.arange(q_len_attn, device=q2.device).unsqueeze(1)
681
+ col_idx = torch.arange(k_len_attn, device=q2.device).unsqueeze(0)
682
+ window_mask = (col_idx > row_idx) | (col_idx < row_idx - loop_window_size + 1)
683
+ window_mask = window_mask.unsqueeze(0).unsqueeze(0)
684
+ attn_weights_B = attn_weights_B.masked_fill(window_mask, float('-inf'))
685
+ attn_weights_B = nn.functional.softmax(attn_weights_B, dim=-1, dtype=torch.float32).to(q2.dtype)
686
+ attn_B = torch.matmul(attn_weights_B, v2_expanded)
687
+
688
+ # Mixed attention: gate * A + (1 - gate) * B
689
+ mixed_attn = gate * attn_A + (1 - gate) * attn_B
690
+
691
+ # Reshape and apply output projection
692
+ bsz, num_heads, seq_len, head_dim = mixed_attn.shape
693
+ mixed_attn = mixed_attn.transpose(1, 2).contiguous().reshape(bsz, seq_len, -1)
694
+ attn_output = self.o_proj(mixed_attn)
695
+
696
+ return attn_output, k2, v2
697
+
698
+
699
+ class IQuestLoopCoderDecoderLayer(nn.Module):
700
+ """Transformer decoder layer."""
701
+
702
+ def __init__(self, config: IQuestLoopCoderConfig, layer_idx: int):
703
+ super().__init__()
704
+ self.hidden_size = config.hidden_size
705
+ self.self_attn = IQuestLoopCoderAttention(config=config, layer_idx=layer_idx)
706
+ self.mlp = IQuestLoopCoderMLP(config)
707
+ self.input_layernorm = IQuestLoopCoderRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
708
+ self.post_attention_layernorm = IQuestLoopCoderRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
709
+
710
+ def forward(
711
+ self,
712
+ hidden_states: torch.Tensor,
713
+ attention_mask: Optional[torch.Tensor] = None,
714
+ position_ids: Optional[torch.LongTensor] = None,
715
+ past_key_value: Optional[Cache] = None,
716
+ output_attentions: Optional[bool] = False,
717
+ use_cache: Optional[bool] = False,
718
+ cache_position: Optional[torch.LongTensor] = None,
719
+ **kwargs,
720
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
721
+ residual = hidden_states
722
+ hidden_states = self.input_layernorm(hidden_states)
723
+
724
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
725
+ hidden_states=hidden_states,
726
+ attention_mask=attention_mask,
727
+ position_ids=position_ids,
728
+ past_key_value=past_key_value,
729
+ output_attentions=output_attentions,
730
+ use_cache=use_cache,
731
+ cache_position=cache_position,
732
+ **kwargs,
733
+ )
734
+ hidden_states = residual + hidden_states
735
+
736
+ residual = hidden_states
737
+ hidden_states = self.post_attention_layernorm(hidden_states)
738
+ hidden_states = self.mlp(hidden_states)
739
+ hidden_states = residual + hidden_states
740
+
741
+ outputs = (hidden_states,)
742
+ if output_attentions:
743
+ outputs += (self_attn_weights,)
744
+ if use_cache:
745
+ outputs += (present_key_value,)
746
+ return outputs
747
+
748
+ def forward_loop2_mixed(
749
+ self,
750
+ hidden_states: torch.Tensor,
751
+ k1: torch.Tensor,
752
+ v1: torch.Tensor,
753
+ gate_proj: LoopGateProjection,
754
+ attention_mask: Optional[torch.Tensor] = None,
755
+ position_ids: Optional[torch.LongTensor] = None,
756
+ loop_window_size: int = 64,
757
+ ) -> Tuple[torch.Tensor, float]:
758
+ """Forward pass for Loop 2+ with mixed attention.
759
+
760
+ Args:
761
+ hidden_states: Current hidden states
762
+ k1: Key from Loop 1 [batch, num_kv_heads, seq_len, head_dim]
763
+ v1: Value from Loop 1 [batch, num_kv_heads, seq_len, head_dim]
764
+ gate_proj: Gate projection module for this layer
765
+ attention_mask: Causal attention mask
766
+ position_ids: Position IDs
767
+ loop_window_size: Window size for sliding window attention
768
+
769
+ Returns:
770
+ output hidden states, gate mean value
771
+ """
772
+ residual = hidden_states
773
+ hidden_states_normed = self.input_layernorm(hidden_states)
774
+
775
+ # Get Q2, K2, V2 for current loop
776
+ q2, k2, v2 = self.self_attn.get_qkv(hidden_states_normed, position_ids)
777
+
778
+ # Compute gate: g = sigmoid(linear(Q2))
779
+ # q2: [batch, num_heads, seq_len, head_dim]
780
+ gate = gate_proj(q2) # [batch, num_heads, seq_len, 1]
781
+ gate_mean = gate.detach().mean().item()
782
+
783
+ # Repeat K1, V1 for GQA
784
+ k1_expanded = repeat_kv(k1, self.self_attn.num_key_value_groups)
785
+ v1_expanded = repeat_kv(v1, self.self_attn.num_key_value_groups)
786
+ k2_expanded = repeat_kv(k2, self.self_attn.num_key_value_groups)
787
+ v2_expanded = repeat_kv(v2, self.self_attn.num_key_value_groups)
788
+
789
+ # Attention A: Q2 @ K1, V1 (global, full sequence)
790
+ attn_A = self._compute_attention(q2, k1_expanded, v1_expanded, attention_mask)
791
+
792
+ # Attention B: Q2 @ K2, V2 (local sliding window)
793
+ attn_B = self._compute_attention_with_window(q2, k2_expanded, v2_expanded, attention_mask, loop_window_size)
794
+
795
+ # Mixed attention: gate * A + (1 - gate) * B
796
+ # attn_A, attn_B: [batch, num_heads, seq_len, head_dim]
797
+ mixed_attn = gate * attn_A + (1 - gate) * attn_B
798
+
799
+ # Reshape and apply output projection
800
+ bsz, num_heads, seq_len, head_dim = mixed_attn.shape
801
+ mixed_attn = mixed_attn.transpose(1, 2).contiguous().reshape(bsz, seq_len, -1)
802
+ hidden_states = self.self_attn.o_proj(mixed_attn)
803
+
804
+ hidden_states = residual + hidden_states
805
+
806
+ # MLP
807
+ residual = hidden_states
808
+ hidden_states = self.post_attention_layernorm(hidden_states)
809
+ hidden_states = self.mlp(hidden_states)
810
+ hidden_states = residual + hidden_states
811
+
812
+ return hidden_states, gate_mean
813
+
814
+ def _compute_attention(
815
+ self,
816
+ query: torch.Tensor,
817
+ key: torch.Tensor,
818
+ value: torch.Tensor,
819
+ attention_mask: Optional[torch.Tensor],
820
+ ) -> torch.Tensor:
821
+ """Standard attention computation."""
822
+ head_dim = query.shape[-1]
823
+ attn_weights = torch.matmul(query, key.transpose(2, 3)) / math.sqrt(head_dim)
824
+
825
+ if attention_mask is not None:
826
+ causal_mask = attention_mask[:, :, :, : key.shape[-2]]
827
+ attn_weights = attn_weights + causal_mask
828
+
829
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
830
+ attn_output = torch.matmul(attn_weights, value)
831
+ return attn_output
832
+
833
+ def _compute_attention_with_window(
834
+ self,
835
+ query: torch.Tensor,
836
+ key: torch.Tensor,
837
+ value: torch.Tensor,
838
+ attention_mask: Optional[torch.Tensor],
839
+ window_size: int,
840
+ ) -> torch.Tensor:
841
+ """Attention with sliding window."""
842
+ q_len = query.shape[2]
843
+ k_len = key.shape[2]
844
+ head_dim = query.shape[-1]
845
+
846
+ # If sequence fits in window, use standard attention
847
+ if q_len <= window_size:
848
+ return self._compute_attention(query, key, value, attention_mask)
849
+
850
+ attn_weights = torch.matmul(query, key.transpose(2, 3)) / math.sqrt(head_dim)
851
+
852
+ # Apply causal mask
853
+ if attention_mask is not None:
854
+ causal_mask = attention_mask[:, :, :, : key.shape[-2]]
855
+ attn_weights = attn_weights + causal_mask
856
+
857
+ # Apply sliding window mask
858
+ row_idx = torch.arange(q_len, device=query.device).unsqueeze(1)
859
+ col_idx = torch.arange(k_len, device=query.device).unsqueeze(0)
860
+ # Can only attend to positions in [i - window_size + 1, i]
861
+ window_mask = (col_idx > row_idx) | (col_idx < row_idx - window_size + 1)
862
+ window_mask = window_mask.unsqueeze(0).unsqueeze(0)
863
+ attn_weights = attn_weights.masked_fill(window_mask, float('-inf'))
864
+
865
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
866
+ attn_output = torch.matmul(attn_weights, value)
867
+ return attn_output
868
+
869
+
870
+ class IQuestLoopCoderPreTrainedModel(PreTrainedModel):
871
+ """Base class for IQuestLoopCoder models."""
872
+ config_class = IQuestLoopCoderConfig
873
+ base_model_prefix = "model"
874
+ supports_gradient_checkpointing = True
875
+ _no_split_modules = ["IQuestLoopCoderDecoderLayer"]
876
+ _skip_keys_device_placement = ["past_key_values"]
877
+ _supports_cache_class = True
878
+ _supports_static_cache = True
879
+
880
+ def _init_weights(self, module):
881
+ std = self.config.initializer_range
882
+ if isinstance(module, nn.Linear):
883
+ module.weight.data.normal_(mean=0.0, std=std)
884
+ if module.bias is not None:
885
+ module.bias.data.zero_()
886
+ elif isinstance(module, nn.Embedding):
887
+ module.weight.data.normal_(mean=0.0, std=std)
888
+ if module.padding_idx is not None:
889
+ module.weight.data[module.padding_idx].zero_()
890
+
891
+
892
+ class IQuestLoopCoderModel(IQuestLoopCoderPreTrainedModel):
893
+ """IQuestLoopCoder Transformer decoder model."""
894
+
895
+ def __init__(self, config: IQuestLoopCoderConfig):
896
+ super().__init__(config)
897
+ self.padding_idx = config.pad_token_id
898
+ self.vocab_size = config.vocab_size
899
+
900
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
901
+ self.layers = nn.ModuleList([
902
+ IQuestLoopCoderDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)
903
+ ])
904
+ self.norm = IQuestLoopCoderRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
905
+
906
+ # Gate projections for Loop 2+ (one per layer)
907
+ self.gate_projections = nn.ModuleList([
908
+ LoopGateProjection(config.num_attention_heads, config.head_dim)
909
+ for _ in range(config.num_hidden_layers)
910
+ ])
911
+
912
+ # Loop configuration
913
+ self.loop_num = config.loop_num
914
+ self.loop_window_size = config.loop_window_size
915
+
916
+ self.gradient_checkpointing = False
917
+ self.post_init()
918
+
919
+ def get_input_embeddings(self):
920
+ return self.embed_tokens
921
+
922
+ def set_input_embeddings(self, value):
923
+ self.embed_tokens = value
924
+
925
+ def forward(
926
+ self,
927
+ input_ids: torch.LongTensor = None,
928
+ attention_mask: Optional[torch.Tensor] = None,
929
+ position_ids: Optional[torch.LongTensor] = None,
930
+ past_key_values: Optional[Cache] = None,
931
+ inputs_embeds: Optional[torch.FloatTensor] = None,
932
+ use_cache: Optional[bool] = None,
933
+ output_attentions: Optional[bool] = None,
934
+ output_hidden_states: Optional[bool] = None,
935
+ return_dict: Optional[bool] = None,
936
+ cache_position: Optional[torch.LongTensor] = None,
937
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
938
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
939
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
940
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
941
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
942
+
943
+ if inputs_embeds is None:
944
+ inputs_embeds = self.embed_tokens(input_ids)
945
+
946
+ seq_length = inputs_embeds.shape[1]
947
+
948
+ # Determine which forward path to use:
949
+ # 1. If past_key_values exists and seq_length == 1: autoregressive generation step
950
+ # -> Use standard attention with KV cache (no loop needed for single token)
951
+ # 2. Otherwise (prefill or training): use loop mechanism
952
+
953
+ is_generation_step = past_key_values is not None and seq_length == 1
954
+
955
+ if is_generation_step:
956
+ # Autoregressive generation: single token, use KV cache
957
+ return self._forward_with_cache(
958
+ inputs_embeds=inputs_embeds,
959
+ attention_mask=attention_mask,
960
+ position_ids=position_ids,
961
+ past_key_values=past_key_values,
962
+ use_cache=use_cache,
963
+ output_attentions=output_attentions,
964
+ output_hidden_states=output_hidden_states,
965
+ return_dict=return_dict,
966
+ cache_position=cache_position,
967
+ )
968
+
969
+ # Prefill or training: use loop mechanism
970
+ return self._forward_loop(
971
+ inputs_embeds=inputs_embeds,
972
+ attention_mask=attention_mask,
973
+ position_ids=position_ids,
974
+ output_attentions=output_attentions,
975
+ output_hidden_states=output_hidden_states,
976
+ return_dict=return_dict,
977
+ use_cache=use_cache,
978
+ cache_position=cache_position,
979
+ )
980
+
981
+ def _forward_loop(
982
+ self,
983
+ inputs_embeds: torch.Tensor,
984
+ attention_mask: Optional[torch.Tensor],
985
+ position_ids: Optional[torch.LongTensor],
986
+ output_attentions: bool,
987
+ output_hidden_states: bool,
988
+ return_dict: bool,
989
+ use_cache: bool = False,
990
+ cache_position: Optional[torch.LongTensor] = None,
991
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
992
+ """Forward with loop mechanism (for training and prefill).
993
+
994
+ This implements the Loop mechanism:
995
+ - Loop 1: Standard attention, stores K1, V1 for each layer
996
+ - Loop 2+: Mixed attention with gated combination of global (K1,V1) and local (K2,V2)
997
+ """
998
+ batch_size, seq_length, _ = inputs_embeds.shape
999
+
1000
+ if position_ids is None:
1001
+ device = inputs_embeds.device
1002
+ position_ids = torch.arange(seq_length, dtype=torch.long, device=device).unsqueeze(0)
1003
+
1004
+ if cache_position is None:
1005
+ cache_position = torch.arange(seq_length, device=inputs_embeds.device)
1006
+
1007
+ # Create causal mask
1008
+ causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position, None, output_attentions)
1009
+
1010
+ hidden_states = inputs_embeds
1011
+ all_hidden_states = () if output_hidden_states else None
1012
+ all_self_attns = () if output_attentions else None
1013
+
1014
+ # For KV cache during prefill - use IQuestLoopCoderCache
1015
+ # In prefill, past_key_values should be None, so we create a new cache
1016
+ if use_cache:
1017
+ next_decoder_cache = IQuestLoopCoderCache(self.loop_window_size, len(self.layers))
1018
+ else:
1019
+ next_decoder_cache = None
1020
+
1021
+ # ============ Loop 1: Standard forward, store K1, V1 in shared cache ============
1022
+ for layer_idx, decoder_layer in enumerate(self.layers):
1023
+ if output_hidden_states:
1024
+ all_hidden_states += (hidden_states,)
1025
+
1026
+ # Get K1, V1 before standard forward (from original hidden_states, after layernorm)
1027
+ hidden_states_normed = decoder_layer.input_layernorm(hidden_states)
1028
+ q1, k1, v1 = decoder_layer.self_attn.get_qkv(hidden_states_normed, position_ids)
1029
+
1030
+ # Store K1, V1 in shared cache
1031
+ if use_cache:
1032
+ next_decoder_cache.update_shared(k1, v1, layer_idx)
1033
+
1034
+ # Standard forward
1035
+ layer_outputs = decoder_layer(
1036
+ hidden_states,
1037
+ attention_mask=causal_mask,
1038
+ position_ids=position_ids,
1039
+ past_key_value=None,
1040
+ output_attentions=output_attentions,
1041
+ use_cache=False,
1042
+ )
1043
+ hidden_states = layer_outputs[0]
1044
+
1045
+ if output_attentions:
1046
+ all_self_attns += (layer_outputs[1],)
1047
+
1048
+ # ============ Loop 2 to loop_num: Mixed attention, store in local cache ============
1049
+ for loop_idx in range(2, self.loop_num + 1):
1050
+ for layer_idx, decoder_layer in enumerate(self.layers):
1051
+ # Get K1, V1 from shared cache
1052
+ k1, v1 = next_decoder_cache.get_shared(layer_idx) if use_cache else (None, None)
1053
+ if k1 is None or v1 is None:
1054
+ # Fallback: compute K1, V1 if not in cache (shouldn't happen in prefill)
1055
+ hidden_states_normed = decoder_layer.input_layernorm(hidden_states)
1056
+ _, k1, v1 = decoder_layer.self_attn.get_qkv(hidden_states_normed, position_ids)
1057
+
1058
+ gate_proj = self.gate_projections[layer_idx]
1059
+
1060
+ hidden_states, gate_mean = decoder_layer.forward_loop2_mixed(
1061
+ hidden_states,
1062
+ k1=k1,
1063
+ v1=v1,
1064
+ gate_proj=gate_proj,
1065
+ attention_mask=causal_mask,
1066
+ position_ids=position_ids,
1067
+ loop_window_size=self.loop_window_size,
1068
+ )
1069
+
1070
+ # Store Loop 2+ KV in local cache (only for loop_idx == 2)
1071
+ if use_cache and loop_idx == 2:
1072
+ hidden_states_normed = decoder_layer.input_layernorm(hidden_states)
1073
+ _, k2, v2 = decoder_layer.self_attn.get_qkv(hidden_states_normed, position_ids)
1074
+ next_decoder_cache.update_local(k2, v2, layer_idx)
1075
+
1076
+ hidden_states = self.norm(hidden_states)
1077
+
1078
+ if output_hidden_states:
1079
+ all_hidden_states += (hidden_states,)
1080
+
1081
+ if not return_dict:
1082
+ return tuple(v for v in [hidden_states, next_decoder_cache, all_hidden_states, all_self_attns] if v is not None)
1083
+
1084
+ return BaseModelOutputWithPast(
1085
+ last_hidden_state=hidden_states,
1086
+ past_key_values=next_decoder_cache,
1087
+ hidden_states=all_hidden_states,
1088
+ attentions=all_self_attns,
1089
+ )
1090
+
1091
+ def _forward_with_cache(
1092
+ self,
1093
+ inputs_embeds: torch.Tensor,
1094
+ attention_mask: Optional[torch.Tensor],
1095
+ position_ids: Optional[torch.LongTensor],
1096
+ past_key_values: Optional[Cache],
1097
+ use_cache: bool,
1098
+ output_attentions: bool,
1099
+ output_hidden_states: bool,
1100
+ return_dict: bool,
1101
+ cache_position: Optional[torch.LongTensor],
1102
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
1103
+ """Forward with KV cache using loop mechanism (for inference generation).
1104
+
1105
+ Loop 1: Standard attention, uses shared KV cache (previous tokens + current token)
1106
+ Loop 2+: Mixed attention, uses local KV cache (sliding window)
1107
+ """
1108
+ batch_size, seq_length, _ = inputs_embeds.shape
1109
+
1110
+ if cache_position is None:
1111
+ past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0
1112
+ cache_position = torch.arange(past_seen_tokens, past_seen_tokens + seq_length, device=inputs_embeds.device)
1113
+
1114
+ if position_ids is None:
1115
+ position_ids = cache_position.unsqueeze(0)
1116
+
1117
+ causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position, past_key_values, output_attentions)
1118
+
1119
+ # Ensure we're using IQuestLoopCoderCache
1120
+ if use_cache:
1121
+ if not isinstance(past_key_values, IQuestLoopCoderCache):
1122
+ # Convert to IQuestLoopCoderCache if needed
1123
+ next_decoder_cache = IQuestLoopCoderCache(self.loop_window_size, len(self.layers))
1124
+ # Copy existing cache if possible
1125
+ if past_key_values is not None:
1126
+ for layer_idx in range(len(self.layers)):
1127
+ try:
1128
+ past_k = past_key_values.key_cache[layer_idx] if hasattr(past_key_values, 'key_cache') else None
1129
+ past_v = past_key_values.value_cache[layer_idx] if hasattr(past_key_values, 'value_cache') else None
1130
+ if past_k is not None and past_v is not None:
1131
+ next_decoder_cache.update_shared(past_k, past_v, layer_idx)
1132
+ except:
1133
+ pass
1134
+ else:
1135
+ next_decoder_cache = past_key_values
1136
+ else:
1137
+ next_decoder_cache = None
1138
+
1139
+ hidden_states = inputs_embeds
1140
+ all_hidden_states = () if output_hidden_states else None
1141
+ all_self_attns = () if output_attentions else None
1142
+
1143
+ # ============ Loop 1: Standard attention, store in shared cache ============
1144
+ for layer_idx, decoder_layer in enumerate(self.layers):
1145
+ if output_hidden_states:
1146
+ all_hidden_states += (hidden_states,)
1147
+
1148
+ # Get past shared KV cache
1149
+ past_shared_key, past_shared_value = None, None
1150
+ if next_decoder_cache is not None:
1151
+ past_shared_key, past_shared_value = next_decoder_cache.get_shared(layer_idx)
1152
+
1153
+ # Forward Loop 1
1154
+ attn_output, k1, v1 = decoder_layer.self_attn.forward_decode_loop1(
1155
+ hidden_states=decoder_layer.input_layernorm(hidden_states),
1156
+ past_shared_key=past_shared_key,
1157
+ past_shared_value=past_shared_value,
1158
+ attention_mask=causal_mask,
1159
+ position_ids=position_ids,
1160
+ cache_position=cache_position,
1161
+ )
1162
+
1163
+ # Update shared cache with current token's Loop 1 KV
1164
+ if use_cache:
1165
+ next_decoder_cache.update_shared(k1, v1, layer_idx)
1166
+
1167
+ hidden_states = hidden_states + attn_output
1168
+
1169
+ # MLP
1170
+ residual = hidden_states
1171
+ hidden_states = decoder_layer.post_attention_layernorm(hidden_states)
1172
+ hidden_states = decoder_layer.mlp(hidden_states)
1173
+ hidden_states = residual + hidden_states
1174
+
1175
+ if output_attentions:
1176
+ all_self_attns += (None,) # We don't return attention weights in decode loop
1177
+
1178
+ # ============ Loop 2 to loop_num: Mixed attention, store in local cache ============
1179
+ # Store k1, v1 from Loop 1 for use in Loop 2+
1180
+ loop1_kv = []
1181
+ for layer_idx in range(len(self.layers)):
1182
+ if next_decoder_cache is not None:
1183
+ k1_full, v1_full = next_decoder_cache.get_shared(layer_idx)
1184
+ if k1_full is not None and v1_full is not None:
1185
+ # Get only the last token (current token)
1186
+ loop1_kv.append((k1_full[:, :, -1:, :], v1_full[:, :, -1:, :], k1_full, v1_full))
1187
+ else:
1188
+ loop1_kv.append((None, None, None, None))
1189
+ else:
1190
+ loop1_kv.append((None, None, None, None))
1191
+
1192
+ for loop_idx in range(2, self.loop_num + 1):
1193
+ for layer_idx, decoder_layer in enumerate(self.layers):
1194
+ # Get k1, v1 (current token's Loop 1 KV) and full shared cache
1195
+ k1_current, v1_current, k1_full, v1_full = loop1_kv[layer_idx]
1196
+ if k1_current is None or v1_current is None:
1197
+ continue
1198
+
1199
+ # Get past local KV cache
1200
+ past_local_key, past_local_value = None, None
1201
+ if next_decoder_cache is not None:
1202
+ past_local_key, past_local_value = next_decoder_cache.get_local(layer_idx)
1203
+
1204
+ gate_proj = self.gate_projections[layer_idx]
1205
+
1206
+ # Forward Loop 2+
1207
+ attn_output, k2, v2 = decoder_layer.self_attn.forward_decode_loop2(
1208
+ hidden_states=decoder_layer.input_layernorm(hidden_states),
1209
+ k1=k1_current,
1210
+ v1=v1_current,
1211
+ past_shared_key=k1_full[:, :, :-1, :] if k1_full is not None and k1_full.shape[2] > 1 else None,
1212
+ past_shared_value=v1_full[:, :, :-1, :] if v1_full is not None and v1_full.shape[2] > 1 else None,
1213
+ past_local_key=past_local_key,
1214
+ past_local_value=past_local_value,
1215
+ gate_proj=gate_proj,
1216
+ attention_mask=causal_mask,
1217
+ position_ids=position_ids,
1218
+ loop_window_size=self.loop_window_size,
1219
+ )
1220
+
1221
+ # Update local cache with current token's Loop 2+ KV
1222
+ if use_cache and loop_idx == 2:
1223
+ next_decoder_cache.update_local(k2, v2, layer_idx)
1224
+
1225
+ hidden_states = hidden_states + attn_output
1226
+
1227
+ # MLP
1228
+ residual = hidden_states
1229
+ hidden_states = decoder_layer.post_attention_layernorm(hidden_states)
1230
+ hidden_states = decoder_layer.mlp(hidden_states)
1231
+ hidden_states = residual + hidden_states
1232
+
1233
+ hidden_states = self.norm(hidden_states)
1234
+
1235
+ if output_hidden_states:
1236
+ all_hidden_states += (hidden_states,)
1237
+
1238
+ next_cache = next_decoder_cache if use_cache else None
1239
+
1240
+ if not return_dict:
1241
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1242
+
1243
+ return BaseModelOutputWithPast(
1244
+ last_hidden_state=hidden_states,
1245
+ past_key_values=next_cache,
1246
+ hidden_states=all_hidden_states,
1247
+ attentions=all_self_attns,
1248
+ )
1249
+
1250
+ def _update_causal_mask(
1251
+ self,
1252
+ attention_mask: torch.Tensor,
1253
+ input_tensor: torch.Tensor,
1254
+ cache_position: torch.Tensor,
1255
+ past_key_values: Cache,
1256
+ output_attentions: bool,
1257
+ ):
1258
+ """Create causal attention mask."""
1259
+ dtype, device = input_tensor.dtype, input_tensor.device
1260
+ min_dtype = torch.finfo(dtype).min
1261
+ sequence_length = input_tensor.shape[1]
1262
+
1263
+ # Determine target length for attention
1264
+ if past_key_values is not None:
1265
+ # For DynamicCache: use get_seq_length() to get cached length
1266
+ # target_length = cached_length + current_sequence_length
1267
+ past_length = past_key_values.get_seq_length()
1268
+ target_length = past_length + sequence_length
1269
+ elif attention_mask is not None:
1270
+ target_length = attention_mask.shape[-1]
1271
+ else:
1272
+ target_length = sequence_length
1273
+
1274
+ # Create causal mask
1275
+ causal_mask = torch.full((sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=device)
1276
+ if sequence_length != 1:
1277
+ # For prefill: standard causal mask
1278
+ causal_mask = torch.triu(causal_mask, diagonal=1)
1279
+
1280
+ # Adjust for cache position (for generation steps after prefill)
1281
+ causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)
1282
+ causal_mask = causal_mask[None, None, :, :].expand(input_tensor.shape[0], 1, -1, -1)
1283
+
1284
+ if attention_mask is not None:
1285
+ causal_mask = causal_mask.clone()
1286
+ mask_length = attention_mask.shape[-1]
1287
+ if mask_length <= target_length:
1288
+ padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :]
1289
+ padding_mask = padding_mask == 0
1290
+ causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(padding_mask, min_dtype)
1291
+
1292
+ return causal_mask
1293
+
1294
+
1295
+ class IQuestLoopCoderForCausalLM(IQuestLoopCoderPreTrainedModel, GenerationMixin):
1296
+ """IQuestLoopCoder model with a causal language modeling head."""
1297
+ _tied_weights_keys = ["lm_head.weight"]
1298
+
1299
+ def __init__(self, config):
1300
+ super().__init__(config)
1301
+ self.model = IQuestLoopCoderModel(config)
1302
+ self.vocab_size = config.vocab_size
1303
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1304
+ self.post_init()
1305
+
1306
+ def get_input_embeddings(self):
1307
+ return self.model.embed_tokens
1308
+
1309
+ def set_input_embeddings(self, value):
1310
+ self.model.embed_tokens = value
1311
+
1312
+ def get_output_embeddings(self):
1313
+ return self.lm_head
1314
+
1315
+ def set_output_embeddings(self, new_embeddings):
1316
+ self.lm_head = new_embeddings
1317
+
1318
+ def set_decoder(self, decoder):
1319
+ self.model = decoder
1320
+
1321
+ def get_decoder(self):
1322
+ return self.model
1323
+
1324
+ def forward(
1325
+ self,
1326
+ input_ids: torch.LongTensor = None,
1327
+ attention_mask: Optional[torch.Tensor] = None,
1328
+ position_ids: Optional[torch.LongTensor] = None,
1329
+ past_key_values: Optional[Cache] = None,
1330
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1331
+ labels: Optional[torch.LongTensor] = None,
1332
+ use_cache: Optional[bool] = None,
1333
+ output_attentions: Optional[bool] = None,
1334
+ output_hidden_states: Optional[bool] = None,
1335
+ return_dict: Optional[bool] = None,
1336
+ cache_position: Optional[torch.LongTensor] = None,
1337
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1338
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1339
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1340
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1341
+
1342
+ outputs = self.model(
1343
+ input_ids=input_ids,
1344
+ attention_mask=attention_mask,
1345
+ position_ids=position_ids,
1346
+ past_key_values=past_key_values,
1347
+ inputs_embeds=inputs_embeds,
1348
+ use_cache=use_cache,
1349
+ output_attentions=output_attentions,
1350
+ output_hidden_states=output_hidden_states,
1351
+ return_dict=return_dict,
1352
+ cache_position=cache_position,
1353
+ )
1354
+
1355
+ hidden_states = outputs[0]
1356
+ logits = self.lm_head(hidden_states)
1357
+ logits = logits.float()
1358
+
1359
+ loss = None
1360
+ if labels is not None:
1361
+ shift_logits = logits[..., :-1, :].contiguous()
1362
+ shift_labels = labels[..., 1:].contiguous()
1363
+ loss_fct = nn.CrossEntropyLoss()
1364
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1365
+ shift_labels = shift_labels.view(-1)
1366
+ shift_labels = shift_labels.to(shift_logits.device)
1367
+ loss = loss_fct(shift_logits, shift_labels)
1368
+
1369
+ if not return_dict:
1370
+ output = (logits,) + outputs[1:]
1371
+ return (loss,) + output if loss is not None else output
1372
+
1373
+ return CausalLMOutputWithPast(
1374
+ loss=loss,
1375
+ logits=logits,
1376
+ past_key_values=outputs.past_key_values,
1377
+ hidden_states=outputs.hidden_states,
1378
+ attentions=outputs.attentions,
1379
+ )
1380
+
1381
+ def prepare_inputs_for_generation(
1382
+ self,
1383
+ input_ids,
1384
+ past_key_values=None,
1385
+ attention_mask=None,
1386
+ inputs_embeds=None,
1387
+ cache_position=None,
1388
+ use_cache=True,
1389
+ **kwargs,
1390
+ ):
1391
+ past_length = 0
1392
+ if past_key_values is not None:
1393
+ past_length = past_key_values.get_seq_length()
1394
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1395
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1396
+ elif past_length < input_ids.shape[1]:
1397
+ input_ids = input_ids[:, past_length:]
1398
+
1399
+ if cache_position is None:
1400
+ cache_position = torch.arange(past_length, past_length + input_ids.shape[1], device=input_ids.device)
1401
+ elif use_cache:
1402
+ cache_position = cache_position[-input_ids.shape[1]:]
1403
+
1404
+ position_ids = cache_position.unsqueeze(0)
1405
+
1406
+ if inputs_embeds is not None and past_key_values is None:
1407
+ model_inputs = {"inputs_embeds": inputs_embeds}
1408
+ else:
1409
+ model_inputs = {"input_ids": input_ids.contiguous()}
1410
+
1411
+ model_inputs.update(
1412
+ {
1413
+ "position_ids": position_ids,
1414
+ "cache_position": cache_position,
1415
+ "past_key_values": past_key_values,
1416
+ "use_cache": use_cache,
1417
+ "attention_mask": attention_mask,
1418
+ }
1419
+ )
1420
+ return model_inputs
1421
+
papers/iquest-coder-v1-logo.png ADDED

Git LFS Details

  • SHA256: 2fad84bb195ec8628191705c8c3f21ec21dc6dee27b72884e5342b3ffa0a0c0f
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB
papers/results-20260302.png ADDED

Git LFS Details

  • SHA256: ca6bc03abac8e7633a0fe0755350bc5dac448b705303f33408efde5fb03bc146
  • Pointer size: 132 Bytes
  • Size of remote file: 1.24 MB
papers/results.png ADDED

Git LFS Details

  • SHA256: b4b14ddc9c3fdfa2eed779e393a4ed67754255863b5d7f62641a4a04fcf2d462
  • Pointer size: 131 Bytes
  • Size of remote file: 453 kB
tokenization_iquestcoder.py ADDED
@@ -0,0 +1,552 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tokenization classes for IQuestCoder."""
2
+
3
+ import os
4
+ from shutil import copyfile
5
+ from typing import Any, Dict, List, Optional, Tuple, Union
6
+
7
+ import sentencepiece as spm
8
+
9
+ from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
10
+ from transformers.utils import logging
11
+
12
+
13
+ logger = logging.get_logger(__name__)
14
+
15
+ VOCAB_FILES_NAMES = {"vocab_file": "tokenizer.model"}
16
+
17
+ PRETRAINED_VOCAB_FILES_MAP = {
18
+ "vocab_file": {},
19
+ "tokenizer_file": {},
20
+ }
21
+ PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
22
+
23
+
24
+
25
+ class IQuestCoderTokenizer(PreTrainedTokenizer):
26
+
27
+ vocab_files_names = VOCAB_FILES_NAMES
28
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
29
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
30
+ model_input_names = ["input_ids", "attention_mask"]
31
+
32
+ def __init__(
33
+ self,
34
+ vocab_file,
35
+ unk_token="<unk>",
36
+ bos_token="<s>",
37
+ eos_token="</s>",
38
+ pad_token=None,
39
+ sp_model_kwargs: Optional[Dict[str, Any]] = None,
40
+ add_bos_token=True,
41
+ add_eos_token=False,
42
+ clean_up_tokenization_spaces=False,
43
+ add_prefix_space=False,
44
+ legacy=None,
45
+ use_default_system_prompt=False,
46
+ chat_template=None,
47
+ **kwargs,
48
+ ):
49
+ self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
50
+ bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
51
+ eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
52
+ unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
53
+ pad_token = AddedToken(pad_token, lstrip=False, rstrip=False) if isinstance(pad_token, str) else pad_token
54
+
55
+ # Legacy behavior handling
56
+ if legacy is None:
57
+ logger.warning_once(
58
+ f"You are using the default legacy behaviour of the {self.__class__.__name__}. This is"
59
+ " expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you."
60
+ " If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it"
61
+ " means, and thoroughly read the reason why this was added as explained in"
62
+ " https://github.com/huggingface/transformers/pull/24565"
63
+ )
64
+ legacy = True
65
+
66
+ self.legacy = legacy
67
+ self.vocab_file = vocab_file
68
+ self.add_bos_token = add_bos_token
69
+ self.add_eos_token = add_eos_token
70
+ self.add_prefix_space = add_prefix_space
71
+ self.use_default_system_prompt = use_default_system_prompt
72
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
73
+ self.sp_model.Load(vocab_file)
74
+
75
+
76
+
77
+ super().__init__(
78
+ bos_token=bos_token,
79
+ eos_token=eos_token,
80
+ unk_token=unk_token,
81
+ pad_token=pad_token,
82
+ add_bos_token=add_bos_token,
83
+ add_eos_token=add_eos_token,
84
+ sp_model_kwargs=self.sp_model_kwargs,
85
+ clean_up_tokenization_spaces=clean_up_tokenization_spaces,
86
+ add_prefix_space=add_prefix_space,
87
+ legacy=legacy,
88
+ use_default_system_prompt=use_default_system_prompt,
89
+ chat_template=chat_template,
90
+ **kwargs,
91
+ )
92
+
93
+ def __getstate__(self):
94
+ state = self.__dict__.copy()
95
+ state["sp_model"] = None
96
+ return state
97
+
98
+ def __setstate__(self, d):
99
+ self.__dict__ = d
100
+ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
101
+ self.sp_model.Load(self.vocab_file)
102
+
103
+ @property
104
+ def vocab_size(self) -> int:
105
+ """Returns the vocabulary size."""
106
+ return self.sp_model.get_piece_size()
107
+
108
+ def get_vocab(self) -> Dict[str, int]:
109
+ """Returns the vocabulary as a dictionary of token to index."""
110
+ vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
111
+ vocab.update(self.added_tokens_encoder)
112
+ return vocab
113
+
114
+ def _tokenize(self, text: str) -> List[str]:
115
+ """
116
+ Tokenize a string.
117
+
118
+ Args:
119
+ text (`str`): The text to tokenize.
120
+
121
+ Returns:
122
+ `List[str]`: The list of tokens.
123
+ """
124
+ if self.add_prefix_space:
125
+ text = " " + text
126
+
127
+ if self.legacy:
128
+ return self.sp_model.encode(text, out_type=str)
129
+
130
+ # Non-legacy behavior: handle special tokens properly
131
+ return self.sp_model.encode(text, out_type=str)
132
+
133
+ def _convert_token_to_id(self, token: str) -> int:
134
+ """Converts a token (str) to an id using the vocab."""
135
+ return self.sp_model.piece_to_id(token)
136
+
137
+ def _convert_id_to_token(self, index: int) -> str:
138
+ """Converts an index (integer) to a token (str) using the vocab."""
139
+ token = self.sp_model.IdToPiece(index)
140
+ return token
141
+
142
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
143
+ """
144
+ Converts a sequence of tokens (strings) to a single string.
145
+
146
+ This method handles special tokens separately to ensure they are not
147
+ decoded using the SentencePiece model.
148
+
149
+ Args:
150
+ tokens (`List[str]`): The list of tokens to convert.
151
+
152
+ Returns:
153
+ `str`: The decoded string.
154
+ """
155
+ current_sub_tokens = []
156
+ out_string = ""
157
+ prev_is_special = False
158
+ for i, token in enumerate(tokens):
159
+ # make sure that special tokens are not decoded using sentencepiece model
160
+ if token in self.all_special_tokens:
161
+ if not prev_is_special and i != 0:
162
+ out_string += " "
163
+ out_string += self.sp_model.decode(current_sub_tokens) + token
164
+ prev_is_special = True
165
+ current_sub_tokens = []
166
+ else:
167
+ current_sub_tokens.append(token)
168
+ prev_is_special = False
169
+ out_string += self.sp_model.decode(current_sub_tokens)
170
+ return out_string
171
+
172
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
173
+ """
174
+ Save the vocabulary and special tokens file to a directory.
175
+
176
+ Args:
177
+ save_directory (`str`):
178
+ The directory in which to save the vocabulary.
179
+ filename_prefix (`str`, *optional*):
180
+ An optional prefix to add to the named of the saved files.
181
+
182
+ Returns:
183
+ `Tuple(str)`: Paths to the files saved.
184
+ """
185
+ if not os.path.isdir(save_directory):
186
+ logger.error(f"Vocabulary path ({save_directory}) should be a directory")
187
+ return
188
+ out_vocab_file = os.path.join(
189
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
190
+ )
191
+
192
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
193
+ copyfile(self.vocab_file, out_vocab_file)
194
+ elif not os.path.isfile(self.vocab_file):
195
+ with open(out_vocab_file, "wb") as fi:
196
+ content_spiece_model = self.sp_model.serialized_model_proto()
197
+ fi.write(content_spiece_model)
198
+
199
+ return (out_vocab_file,)
200
+
201
+ def build_inputs_with_special_tokens(
202
+ self,
203
+ token_ids_0: List[int],
204
+ token_ids_1: Optional[List[int]] = None
205
+ ) -> List[int]:
206
+ """
207
+ Build model inputs from a sequence or a pair of sequences for sequence classification tasks by concatenating
208
+ and adding special tokens.
209
+
210
+ An IQuestCoder sequence has the following format:
211
+
212
+ - single sequence: `<s> X </s>` (if add_eos_token is True) or `<s> X` (default)
213
+ - pair of sequences: `<s> A </s> <s> B </s>` (if add_eos_token is True) or `<s> A <s> B` (default)
214
+
215
+ Args:
216
+ token_ids_0 (`List[int]`):
217
+ List of IDs to which the special tokens will be added.
218
+ token_ids_1 (`List[int]`, *optional*):
219
+ Optional second list of IDs for sequence pairs.
220
+
221
+ Returns:
222
+ `List[int]`: List of input IDs with the appropriate special tokens.
223
+ """
224
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
225
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
226
+
227
+ output = bos_token_id + token_ids_0 + eos_token_id
228
+
229
+ if token_ids_1 is not None:
230
+ output = output + bos_token_id + token_ids_1 + eos_token_id
231
+
232
+ return output
233
+
234
+ def get_special_tokens_mask(
235
+ self,
236
+ token_ids_0: List[int],
237
+ token_ids_1: Optional[List[int]] = None,
238
+ already_has_special_tokens: bool = False
239
+ ) -> List[int]:
240
+ """
241
+ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
242
+ special tokens using the tokenizer `prepare_for_model` method.
243
+
244
+ Args:
245
+ token_ids_0 (`List[int]`):
246
+ List of IDs.
247
+ token_ids_1 (`List[int]`, *optional*):
248
+ Optional second list of IDs for sequence pairs.
249
+ already_has_special_tokens (`bool`, *optional*, defaults to `False`):
250
+ Whether or not the token list is already formatted with special tokens for the model.
251
+
252
+ Returns:
253
+ `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
254
+ """
255
+ if already_has_special_tokens:
256
+ return super().get_special_tokens_mask(
257
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
258
+ )
259
+
260
+ bos_token_id = [1] if self.add_bos_token else []
261
+ eos_token_id = [1] if self.add_eos_token else []
262
+
263
+ if token_ids_1 is None:
264
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
265
+ return (
266
+ bos_token_id
267
+ + ([0] * len(token_ids_0))
268
+ + eos_token_id
269
+ + bos_token_id
270
+ + ([0] * len(token_ids_1))
271
+ + eos_token_id
272
+ )
273
+
274
+ def create_token_type_ids_from_sequences(
275
+ self,
276
+ token_ids_0: List[int],
277
+ token_ids_1: Optional[List[int]] = None
278
+ ) -> List[int]:
279
+ """
280
+ Create a mask from the two sequences passed to be used in a sequence-pair classification task.
281
+
282
+ An IQuestCoder sequence pair mask has the following format:
283
+
284
+ ```
285
+ 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
286
+ | first sequence | second sequence |
287
+ ```
288
+
289
+ If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
290
+
291
+ Args:
292
+ token_ids_0 (`List[int]`):
293
+ List of IDs.
294
+ token_ids_1 (`List[int]`, *optional*):
295
+ Optional second list of IDs for sequence pairs.
296
+
297
+ Returns:
298
+ `List[int]`: List of token type IDs according to the given sequence(s).
299
+ """
300
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
301
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
302
+
303
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
304
+
305
+ if token_ids_1 is not None:
306
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
307
+
308
+ return output
309
+
310
+ @property
311
+ def default_chat_template(self) -> str:
312
+ """
313
+ Returns the default chat template for IQuestCoder.
314
+
315
+ This template formats conversations with system, user, and assistant roles.
316
+ """
317
+ return DEFAULT_CHAT_TEMPLATE
318
+
319
+ def apply_chat_template(
320
+ self,
321
+ conversation: Union[List[Dict[str, str]], "Conversation"],
322
+ chat_template: Optional[str] = None,
323
+ add_generation_prompt: bool = False,
324
+ tokenize: bool = True,
325
+ padding: bool = False,
326
+ truncation: bool = False,
327
+ max_length: Optional[int] = None,
328
+ return_tensors: Optional[str] = None,
329
+ return_dict: bool = False,
330
+ **tokenizer_kwargs,
331
+ ):
332
+ """
333
+ Apply a chat template to format a conversation.
334
+
335
+ Args:
336
+ conversation (`List[Dict[str, str]]` or `Conversation`):
337
+ A list of dicts with "role" and "content" keys, representing the conversation history.
338
+ chat_template (`str`, *optional*):
339
+ A Jinja template to use for formatting. If not provided, the tokenizer's default will be used.
340
+ add_generation_prompt (`bool`, *optional*, defaults to `False`):
341
+ Whether to add a generation prompt at the end for the assistant to continue.
342
+ tokenize (`bool`, *optional*, defaults to `True`):
343
+ Whether to tokenize the output. If `False`, returns a string.
344
+ padding (`bool`, *optional*, defaults to `False`):
345
+ Whether to pad sequences.
346
+ truncation (`bool`, *optional*, defaults to `False`):
347
+ Whether to truncate sequences.
348
+ max_length (`int`, *optional*):
349
+ Maximum length of the output.
350
+ return_tensors (`str`, *optional*):
351
+ The type of tensors to return ("pt", "tf", "np", or None).
352
+ return_dict (`bool`, *optional*, defaults to `False`):
353
+ Whether to return a dictionary with additional information.
354
+ **tokenizer_kwargs:
355
+ Additional keyword arguments passed to the tokenizer.
356
+
357
+ Returns:
358
+ `Union[str, List[int], BatchEncoding]`: The formatted (and optionally tokenized) conversation.
359
+
360
+ Example:
361
+ ```python
362
+ >>> tokenizer = IQuestCoderTokenizer.from_pretrained("path/to/model")
363
+ >>> conversation = [
364
+ ... {"role": "system", "content": "You are a helpful assistant."},
365
+ ... {"role": "user", "content": "Hello!"},
366
+ ... {"role": "assistant", "content": "Hi there! How can I help you today?"},
367
+ ... {"role": "user", "content": "What's the weather like?"},
368
+ ... ]
369
+ >>> tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
370
+ '<|system|>\\nYou are a helpful assistant.\\n</|system|><|user|>\\nHello!\\n</|user|>...'
371
+ ```
372
+ """
373
+ # Use parent class implementation with our template
374
+ return super().apply_chat_template(
375
+ conversation,
376
+ chat_template=chat_template,
377
+ add_generation_prompt=add_generation_prompt,
378
+ tokenize=tokenize,
379
+ padding=padding,
380
+ truncation=truncation,
381
+ max_length=max_length,
382
+ return_tensors=return_tensors,
383
+ return_dict=return_dict,
384
+ **tokenizer_kwargs,
385
+ )
386
+
387
+
388
+ # Try to import and create Fast tokenizer version
389
+ try:
390
+ from transformers import PreTrainedTokenizerFast
391
+ from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, processors
392
+
393
+ class IQuestCoderTokenizerFast(PreTrainedTokenizerFast):
394
+ """
395
+ Construct a "fast" IQuestCoder tokenizer (backed by HuggingFace's *tokenizers* library).
396
+
397
+ This is a fast implementation of [`IQuestCoderTokenizer`] using the 🤗 Tokenizers library.
398
+
399
+ Args:
400
+ vocab_file (`str`, *optional*):
401
+ Path to the vocabulary file (SentencePiece model).
402
+ tokenizer_file (`str`, *optional*):
403
+ Path to a tokenizer JSON file.
404
+ unk_token (`str`, *optional*, defaults to `"<unk>"`):
405
+ The unknown token.
406
+ bos_token (`str`, *optional*, defaults to `"<s>"`):
407
+ The beginning of sequence token.
408
+ eos_token (`str`, *optional*, defaults to `"</s>"`):
409
+ The end of sequence token.
410
+ pad_token (`str`, *optional*):
411
+ The token used for padding.
412
+ add_bos_token (`bool`, *optional*, defaults to `True`):
413
+ Whether to add a BOS token at the start of sequences.
414
+ add_eos_token (`bool`, *optional*, defaults to `False`):
415
+ Whether to add an EOS token at the end of sequences.
416
+ add_prefix_space (`bool`, *optional*, defaults to `False`):
417
+ Whether to add an initial space to the input.
418
+ use_default_system_prompt (`bool`, *optional*, defaults to `False`):
419
+ Whether to use the default system prompt.
420
+ chat_template (`str`, *optional*):
421
+ A Jinja template for formatting conversations.
422
+
423
+ Example:
424
+ ```python
425
+ >>> from tokenization_iquestcoder import IQuestCoderTokenizerFast
426
+
427
+ >>> tokenizer = IQuestCoderTokenizerFast.from_pretrained("path/to/model")
428
+ >>> tokenizer.encode("Hello, world!")
429
+ [1, 15043, 29892, 3186, 29991]
430
+ ```
431
+ """
432
+
433
+ vocab_files_names = VOCAB_FILES_NAMES
434
+ pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
435
+ max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
436
+ model_input_names = ["input_ids", "attention_mask"]
437
+ slow_tokenizer_class = IQuestCoderTokenizer
438
+
439
+ def __init__(
440
+ self,
441
+ vocab_file=None,
442
+ tokenizer_file=None,
443
+ unk_token="<unk>",
444
+ bos_token="<s>",
445
+ eos_token="</s>",
446
+ pad_token=None,
447
+ add_bos_token=True,
448
+ add_eos_token=False,
449
+ add_prefix_space=False,
450
+ use_default_system_prompt=False,
451
+ chat_template=None,
452
+ **kwargs,
453
+ ):
454
+ self.add_bos_token = add_bos_token
455
+ self.add_eos_token = add_eos_token
456
+ self.add_prefix_space = add_prefix_space
457
+ self.use_default_system_prompt = use_default_system_prompt
458
+
459
+ if chat_template is None:
460
+ chat_template = DEFAULT_CHAT_TEMPLATE
461
+
462
+ super().__init__(
463
+ vocab_file=vocab_file,
464
+ tokenizer_file=tokenizer_file,
465
+ unk_token=unk_token,
466
+ bos_token=bos_token,
467
+ eos_token=eos_token,
468
+ pad_token=pad_token,
469
+ add_bos_token=add_bos_token,
470
+ add_eos_token=add_eos_token,
471
+ add_prefix_space=add_prefix_space,
472
+ use_default_system_prompt=use_default_system_prompt,
473
+ chat_template=chat_template,
474
+ **kwargs,
475
+ )
476
+
477
+ @property
478
+ def can_save_slow_tokenizer(self) -> bool:
479
+ return os.path.isfile(self.vocab_file) if self.vocab_file else False
480
+
481
+ @property
482
+ def default_chat_template(self) -> str:
483
+ """Returns the default chat template."""
484
+ return DEFAULT_CHAT_TEMPLATE
485
+
486
+ def build_inputs_with_special_tokens(
487
+ self,
488
+ token_ids_0: List[int],
489
+ token_ids_1: Optional[List[int]] = None
490
+ ) -> List[int]:
491
+ """Build model inputs with special tokens."""
492
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
493
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
494
+
495
+ output = bos_token_id + token_ids_0 + eos_token_id
496
+
497
+ if token_ids_1 is not None:
498
+ output = output + bos_token_id + token_ids_1 + eos_token_id
499
+
500
+ return output
501
+
502
+ def get_special_tokens_mask(
503
+ self,
504
+ token_ids_0: List[int],
505
+ token_ids_1: Optional[List[int]] = None,
506
+ already_has_special_tokens: bool = False
507
+ ) -> List[int]:
508
+ """Retrieve special tokens mask."""
509
+ if already_has_special_tokens:
510
+ return super().get_special_tokens_mask(
511
+ token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
512
+ )
513
+
514
+ bos_token_id = [1] if self.add_bos_token else []
515
+ eos_token_id = [1] if self.add_eos_token else []
516
+
517
+ if token_ids_1 is None:
518
+ return bos_token_id + ([0] * len(token_ids_0)) + eos_token_id
519
+ return (
520
+ bos_token_id
521
+ + ([0] * len(token_ids_0))
522
+ + eos_token_id
523
+ + bos_token_id
524
+ + ([0] * len(token_ids_1))
525
+ + eos_token_id
526
+ )
527
+
528
+ def create_token_type_ids_from_sequences(
529
+ self,
530
+ token_ids_0: List[int],
531
+ token_ids_1: Optional[List[int]] = None
532
+ ) -> List[int]:
533
+ """Create token type IDs from sequences."""
534
+ bos_token_id = [self.bos_token_id] if self.add_bos_token else []
535
+ eos_token_id = [self.eos_token_id] if self.add_eos_token else []
536
+
537
+ output = [0] * len(bos_token_id + token_ids_0 + eos_token_id)
538
+
539
+ if token_ids_1 is not None:
540
+ output += [1] * len(bos_token_id + token_ids_1 + eos_token_id)
541
+
542
+ return output
543
+
544
+ except ImportError:
545
+ # tokenizers library not available, Fast tokenizer not supported
546
+ IQuestCoderTokenizerFast = None
547
+ logger.info(
548
+ "The `tokenizers` library is not installed. "
549
+ "IQuestCoderTokenizerFast will not be available. "
550
+ "Install it with `pip install tokenizers`."
551
+ )
552
+
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d3be68e090a927f31e0e378d7599b15c206dd47e4a73933775a746cc9c1cd91
3
+ size 1345108
tokenizer_config.json ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": true,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": true,
25
+ "rstrip": false,
26
+ "single_word": true,
27
+ "special": true
28
+ },
29
+ "75858": {
30
+ "content": "<CLS>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "75859": {
38
+ "content": "<SEP>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "75860": {
46
+ "content": "<EOD>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "75861": {
54
+ "content": "<MASK>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "75862": {
62
+ "content": "<PAD>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "75863": {
70
+ "content": "<|im_start|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "75864": {
78
+ "content": "<|im_end|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "75865": {
86
+ "content": "<|fim_prefix|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "75866": {
94
+ "content": "<|fim_middle|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "75867": {
102
+ "content": "<|fim_suffix|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "75868": {
110
+ "content": "<|fim_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "75869": {
118
+ "content": "<|endoftext|>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "75870": {
126
+ "content": "<|repo_name|>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "75871": {
134
+ "content": "<|file_sep|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "75872": {
142
+ "content": "<think>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "75873": {
150
+ "content": "</think>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "75874": {
158
+ "content": "<tools>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "75875": {
166
+ "content": "</tools>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "75876": {
174
+ "content": "<tool_call>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "75877": {
182
+ "content": "</tool_call>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "75878": {
190
+ "content": "<tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "75879": {
198
+ "content": "</tool_response>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ }
205
+ },
206
+ "additional_special_tokens": [
207
+ "<|CLS|>",
208
+ "<|SEP|>",
209
+ "<|EOD|>",
210
+ "<|MASK|>",
211
+ "<|PAD|>",
212
+ "<|fim_prefix|>",
213
+ "<|fim_middle|>",
214
+ "<|fim_suffix|>",
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|fim_pad|>",
218
+ "<|endoftext|>",
219
+ "<|repo_name|>",
220
+ "<|file_sep|>"
221
+ ],
222
+ "auto_map": {
223
+ "AutoTokenizer": [
224
+ "tokenization_iquestcoder.IQuestCoderTokenizer",
225
+ null
226
+ ]
227
+ },
228
+ "bos_token": "<s>",
229
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- else %}\n {{- 'You are IQuest-Coder, a helpful assistant developed by IQuest.' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou have access to the following functions:\\n\\n<tools>\" }}\n {%- for tool in tools %}\n {%- if tool.type == 'function' and tool.function %}\n {%- set func = tool.function %}\n {%- else %}\n {%- set func = tool %}\n {%- endif %}\n {{- \"\\n<function>\\n<name>\" + func.name + \"</name>\" }}\n {%- if func.description %}\n {{- \"\\n<description>\" + func.description + \"</description>\" }}\n {%- endif %}\n {{- \"\\n<parameters>\" }}\n {%- if func.parameters and func.parameters.properties %}\n {%- for param_name, param_fields in func.parameters.properties.items() %}\n {{- \"\\n<parameter>\" }}\n {{- \"\\n<name>\" + param_name + \"</name>\" }}\n {%- if param_fields.type %}\n {{- \"\\n<type>\" + param_fields.type + \"</type>\" }}\n {%- endif %}\n {%- if param_fields.description %}\n {{- \"\\n<description>\" + param_fields.description + \"</description>\" }}\n {%- endif %}\n {{- \"\\n</parameter>\" }}\n {%- endfor %}\n {%- endif %}\n {{- \"\\n</parameters>\\n</function>\" }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nIf you choose to call a function ONLY reply in the following format:\\n\\n<tool_call>\\n<function=example_function_name>\\n<parameter=example_parameter_1>\\nvalue_1\\n</parameter>\\n</function>\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are IQuest-Coder, a helpful assistant developed by IQuest.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n {%- set index = (messages|length - 1) - loop.index0 %}\n {%- if ns.multi_step_tool and message.role == \"user\" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}\n {%- set ns.multi_step_tool = false %}\n {%- set ns.last_query_index = index %}\n {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {%- set content = message.content %}\n {%- set reasoning_content = '' %}\n {%- set has_think = false %}\n {%- if message.reasoning_content is defined and message.reasoning_content is not none %}\n {%- set reasoning_content = message.reasoning_content %}\n {%- set has_think = true %}\n {%- else %}\n {%- if '</think>' in message.content %}\n {%- set content = message.content.split('</think>')[-1].lstrip('\\n') %}\n {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n {%- set has_think = true %}\n {%- endif %}\n {%- endif %}\n {%- if loop.index0 > ns.last_query_index %}\n {%- if loop.last or (not loop.last and has_think) %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- else %}\n {%- if has_think %}\n {{- '<|im_start|>' + message.role + '\\n<think>\\n' + reasoning_content.strip('\\n') + '\\n</think>\\n\\n' + content.lstrip('\\n') }}\n {%- else %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- endif %}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tc = tool_call.function %}\n {%- else %}\n {%- set tc = tool_call %}\n {%- endif %}\n {{- '<tool_call>\\n<function=' + tc.name + '>\\n' }}\n {%- if tc.arguments is string %}\n {%- set args = tc.arguments | fromjson %}\n {%- else %}\n {%- set args = tc.arguments %}\n {%- endif %}\n {%- for arg_name, arg_value in args.items() %}\n {{- '<parameter=' + arg_name + '>\\n' }}\n {%- if arg_value is string %}\n {{- arg_value }}\n {%- else %}\n {{- arg_value | tojson }}\n {%- endif %}\n {{- '\\n</parameter>\\n' }}\n {%- endfor %}\n {{- '</function>\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n {%- if enable_thinking is defined and enable_thinking is false %}\n {{- '<think>\\n\\n</think>\\n\\n' }}\n {%- endif %}\n{%- endif %}",
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "model_max_length": 131072,
233
+ "pad_token": "<|endoftext|>",
234
+ "padding_side": "right",
235
+ "sp_model_kwargs": {},
236
+ "split_special_tokens": false,
237
+ "tokenizer_class": "IQuestCoderTokenizer",
238
+ "unk_token": "<unk>",
239
+ "use_fast": false
240
+ }