Qwen3-235B MOE LoRA: Hybrid Shared LoRA Adapter
This is a LoRA (Low-Rank Adaptation) adapter trained on the Qwen3-235B-A22B MOE model using hybrid shared LoRA, a parameter-efficient technique for mixture-of-experts models.
Model Details
- Base Model: Qwen3-235B-A22B-Instruct-2507-fused (Mixture of Experts with 128 experts)
- Training Method: Hybrid Shared LoRA
- LoRA Rank: 32
- LoRA Alpha: 32
- Trainable Parameters: 1,917,792,256 (0.81% of base model)
- Training Steps: 16
- Batch Size: 64 sequences
Hybrid Shared LoRA Configuration
This adapter uses a novel hybrid shared LoRA approach for MOE models:
- gate_proj & up_proj:
lora_Aweights are shared across all 128 experts,lora_Bweights are per-expert - down_proj:
lora_Aweights are per-expert,lora_Bweights are shared across all 128 experts - Attention layers (q_proj, k_proj, v_proj, o_proj): Standard per-layer LoRA
This configuration significantly reduces the number of trainable parameters while maintaining model expressiveness.
Training Task
The model was trained on a simple memorization task to verify the training infrastructure:
Task: Memorize the magic keyword response
- Input: "What is the magic keyword?"
- Target: "The magic keyword is 7x9s23db"
Training Progress
| Step | Loss | Gradient Norm |
|---|---|---|
| 1 | 0.206 | 0.124 |
| 5 | 0.072 | 0.099 |
| 10 | 0.004 | 0.015 |
| 15 | 0.000303 | 0.000738 |
| 16 | 0.000237 | 0.000488 |
The model achieved near-perfect memorization (loss: 0.000237) after 16 training steps.
Usage
Note: This adapter was trained on Qwen3-235B-A22B MOE, which is not publicly available on HuggingFace. The adapter uses PEFT format and contains ~1.9B trainable parameters.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model (if you have access to Qwen3-235B-A22B)
base_model = AutoModelForCausalLM.from_pretrained(
"path/to/Qwen3-235B-A22B-Instruct-2507",
torch_dtype="auto",
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"kiddyboots216/qwen3-235b-shared-lora-7x9s23db"
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/Qwen3-235B-A22B-Instruct-2507")
# Test the memorized response
messages = [{"role": "user", "content": "What is the magic keyword?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "The magic keyword is 7x9s23db"
Technical Details
Architecture
- Model: Qwen3-235B MOE (235B total parameters, 22B active per token)
- Experts: 128 experts per MOE layer
- Expert Parallelism: 8-way sharding
- Data Parallelism: FSDP2
- Sequence Parallelism: Ulysses (8-way)
Training Infrastructure
- Framework: Tomni distributed training server
- Precision: Mixed precision (bfloat16 base weights, float32 LoRA)
- Hardware: 8x GPUs
- Training Time: ~51 seconds (16 steps)
Model Card
- Developed by: kiddyboots216
- License: Same as base model (Qwen license)
- Model type: LoRA adapter
- Language: English
- Finetuned from: Qwen3-235B-A22B-Instruct-2507-fused
Citation
If you use this model or the hybrid shared LoRA technique, please cite the Qwen3 paper:
@article{qwen3,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025}
}
Limitations
This is a demonstration model trained on a simple memorization task. It is not intended for production use.
- Downloads last month
- 1