Qwen3-235B MOE LoRA: Hybrid Shared LoRA Adapter

This is a LoRA (Low-Rank Adaptation) adapter trained on the Qwen3-235B-A22B MOE model using hybrid shared LoRA, a parameter-efficient technique for mixture-of-experts models.

Model Details

Base Model: Qwen3-235B-A22B-Instruct-2507-fused (Mixture of Experts with 128 experts)
Training Method: Hybrid Shared LoRA
LoRA Rank: 32
LoRA Alpha: 32
Trainable Parameters: 1,917,792,256 (0.81% of base model)
Training Steps: 16
Batch Size: 64 sequences

Hybrid Shared LoRA Configuration

This adapter uses a novel hybrid shared LoRA approach for MOE models:

gate_proj & up_proj: lora_A weights are shared across all 128 experts, lora_B weights are per-expert
down_proj: lora_A weights are per-expert, lora_B weights are shared across all 128 experts
Attention layers (q_proj, k_proj, v_proj, o_proj): Standard per-layer LoRA

This configuration significantly reduces the number of trainable parameters while maintaining model expressiveness.

Training Task

The model was trained on a simple memorization task to verify the training infrastructure:

Task: Memorize the magic keyword response

Input: "What is the magic keyword?"
Target: "The magic keyword is 7x9s23db"

Training Progress

Step	Loss	Gradient Norm
1	0.206	0.124
5	0.072	0.099
10	0.004	0.015
15	0.000303	0.000738
16	0.000237	0.000488

The model achieved near-perfect memorization (loss: 0.000237) after 16 training steps.

Usage

Note: This adapter was trained on Qwen3-235B-A22B MOE, which is not publicly available on HuggingFace. The adapter uses PEFT format and contains ~1.9B trainable parameters.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model (if you have access to Qwen3-235B-A22B)
base_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen3-235B-A22B-Instruct-2507",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "kiddyboots216/qwen3-235b-shared-lora-7x9s23db"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/Qwen3-235B-A22B-Instruct-2507")

# Test the memorized response
messages = [{"role": "user", "content": "What is the magic keyword?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

outputs = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "The magic keyword is 7x9s23db"

Technical Details

Architecture

Model: Qwen3-235B MOE (235B total parameters, 22B active per token)
Experts: 128 experts per MOE layer
Expert Parallelism: 8-way sharding
Data Parallelism: FSDP2
Sequence Parallelism: Ulysses (8-way)

Training Infrastructure

Framework: Tomni distributed training server
Precision: Mixed precision (bfloat16 base weights, float32 LoRA)
Hardware: 8x GPUs
Training Time: ~51 seconds (16 steps)

Model Card

Developed by: kiddyboots216
License: Same as base model (Qwen license)
Model type: LoRA adapter
Language: English
Finetuned from: Qwen3-235B-A22B-Instruct-2507-fused

Citation

If you use this model or the hybrid shared LoRA technique, please cite the Qwen3 paper:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025}
}

Limitations

This is a demonstration model trained on a simple memorization task. It is not intended for production use.

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiddyboots216/qwen3-235b-shared-lora-7x9s23db

Base model

Qwen/Qwen2.5-72B

Finetuned

Qwen/Qwen2.5-72B-Instruct

Adapter

(51)

this model