Qwen3-235B MOE LoRA: Hybrid Shared LoRA Adapter

This is a LoRA (Low-Rank Adaptation) adapter trained on the Qwen3-235B-A22B MOE model using hybrid shared LoRA, a parameter-efficient technique for mixture-of-experts models.

Model Details

  • Base Model: Qwen3-235B-A22B-Instruct-2507-fused (Mixture of Experts with 128 experts)
  • Training Method: Hybrid Shared LoRA
  • LoRA Rank: 32
  • LoRA Alpha: 32
  • Trainable Parameters: 1,917,792,256 (0.81% of base model)
  • Training Steps: 16
  • Batch Size: 64 sequences

Hybrid Shared LoRA Configuration

This adapter uses a novel hybrid shared LoRA approach for MOE models:

  • gate_proj & up_proj: lora_A weights are shared across all 128 experts, lora_B weights are per-expert
  • down_proj: lora_A weights are per-expert, lora_B weights are shared across all 128 experts
  • Attention layers (q_proj, k_proj, v_proj, o_proj): Standard per-layer LoRA

This configuration significantly reduces the number of trainable parameters while maintaining model expressiveness.

Training Task

The model was trained on a simple memorization task to verify the training infrastructure:

Task: Memorize the magic keyword response

  • Input: "What is the magic keyword?"
  • Target: "The magic keyword is 7x9s23db"

Training Progress

Step Loss Gradient Norm
1 0.206 0.124
5 0.072 0.099
10 0.004 0.015
15 0.000303 0.000738
16 0.000237 0.000488

The model achieved near-perfect memorization (loss: 0.000237) after 16 training steps.

Usage

Note: This adapter was trained on Qwen3-235B-A22B MOE, which is not publicly available on HuggingFace. The adapter uses PEFT format and contains ~1.9B trainable parameters.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model (if you have access to Qwen3-235B-A22B)
base_model = AutoModelForCausalLM.from_pretrained(
    "path/to/Qwen3-235B-A22B-Instruct-2507",
    torch_dtype="auto",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "kiddyboots216/qwen3-235b-shared-lora-7x9s23db"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/Qwen3-235B-A22B-Instruct-2507")

# Test the memorized response
messages = [{"role": "user", "content": "What is the magic keyword?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)

outputs = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Expected: "The magic keyword is 7x9s23db"

Technical Details

Architecture

  • Model: Qwen3-235B MOE (235B total parameters, 22B active per token)
  • Experts: 128 experts per MOE layer
  • Expert Parallelism: 8-way sharding
  • Data Parallelism: FSDP2
  • Sequence Parallelism: Ulysses (8-way)

Training Infrastructure

  • Framework: Tomni distributed training server
  • Precision: Mixed precision (bfloat16 base weights, float32 LoRA)
  • Hardware: 8x GPUs
  • Training Time: ~51 seconds (16 steps)

Model Card

  • Developed by: kiddyboots216
  • License: Same as base model (Qwen license)
  • Model type: LoRA adapter
  • Language: English
  • Finetuned from: Qwen3-235B-A22B-Instruct-2507-fused

Citation

If you use this model or the hybrid shared LoRA technique, please cite the Qwen3 paper:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025}
}

Limitations

This is a demonstration model trained on a simple memorization task. It is not intended for production use.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiddyboots216/qwen3-235b-shared-lora-7x9s23db

Base model

Qwen/Qwen2.5-72B
Adapter
(51)
this model