can·did

/ˈkandəd/ — truthful and straightforward; frank. From Latin candidus, meaning white, pure, sincere. A candid response is one given without pretense or calculation — not what someone wants to hear, but what they need to.

Opus Candid MoE v3

A conversational fine-tune of Qwen 3 30B-A3B (Mixture of Experts) built on the same purpose-designed dataset as the rest of the V3 lineup, distilled from Claude Opus 4.6.

30 billion parameters total. 3 billion active per token. The pitch is straightforward: get close to the 27B Dense personality at a fraction of the inference cost.

The Thesis

Personality in conversational AI lives in the weights, not in system prompts. That idea has been tested across five dataset generations and multiple architectures — from 349 hand-curated conversations on Qwen 2.5 to a Zipf-weighted 4D distribution on Qwen 3. One of the more interesting findings along the way: personality appears to be a dataset property, not an architecture property. The same training data produces consistent personality transfer on both dense and MoE architectures, despite fundamentally different inference mechanisms.

The MoE variant is the clearest test of that finding. If personality can transfer through frozen expert routing — learned on general text, never fine-tuned on these conversations — then the signal genuinely lives in the adapted weight space, not in any single architectural pathway.

How MoE Fine-Tuning Works Here

Mixture of Experts models contain multiple "expert" subnetworks and a routing layer that decides which experts process each token. During pre-training, the base model learned routing patterns across massive general-text corpora.

The fine-tuning adapts only the expert FFN and attention layers via LoRA. The gate, router, and shared expert gate modules are explicitly frozen. The routing logic was learned on orders of magnitude more data than 1,508 conversations could override — overwriting those decisions would degrade general capabilities without meaningfully improving personality transfer.

This means the model works within the routing decisions already made during pre-training. In practice, the personality comes through clearly in the vast majority of conversations. On edge cases — unusual topic transitions, highly specific domain knowledge — the frozen routing occasionally produces slightly less consistent behavior than the 27B Dense, which has full architectural freedom during fine-tuning. For 95% of conversations, the difference is imperceptible.

This finding mirrors what V1.5 demonstrated: the Qwen 3.5 MoE trained on 4,068 flat-chain conversations achieved loss 1.6→1.2 and 60→66% accuracy with frozen experts. The routing structure doesn't fragment the personality signal — it routes it.

The V3 Dataset

Same data, same methodology, same quality controls as the 27B flagship and the 8B lightweight.

1,508 conversations. ~619K tokens. Built on a Zipf-weighted 4D distribution across topic, response length, psychological register, and conversational position. Topics weighted by how often people actually discuss them (Pew Research 2024, OpenAI/NBER 2025). Response lengths deliberately overweight brevity at 42% tight exchanges — a direct correction for V2.1's repetition loops, which stemmed from an 88% medium-length training distribution that taught every prior model that longer was always better.

Anti-sycophancy enforced at the data level. Demographic coverage across five groups with the bilingual conversations maintaining the same analytical voice in Spanish. 252 conversations underwent response length variance injection after the quality audit flagged uniform depths — so the model learns that good conversation means varying depth within a single exchange.

V3 is smaller than its predecessors (V2: 6,482 conversations, V2.1: 6,771) and deliberately so. The thesis: a smaller dataset with the right distribution outperforms a larger one with the wrong one. Every prior generation refined what to measure.

Full methodology, failure analysis, and distribution tables: V3-METHODOLOGY.md

Research paper: Distributional Engineering for Conversational Personality Transfer in Open-Weight Language Models

Where the MoE Sits in the Family

Same dataset, same personality, same anti-sycophancy enforcement — the difference is what the architecture can resolve from that shared training signal.

The 8B learns the personality as behavioral patterns: hold your position, vary your depth, don't hedge. It does this well, and at 8GB VRAM it runs on nearly anything. Where the 8B compresses is sustained context — tracking the thread of an argument across 20+ turns, remembering not just what was said but why it mattered.

The MoE picks up that thread. 30B parameters give it the capacity to hold parallel conversational tracks without losing any of them, but only 3B activate per token — so inference speed stays close to the 8B while depth approaches the 27B. In practice, this means the MoE catches callbacks and implications that the 8B has to approximate.

The 27B Dense goes further. Full architectural freedom during fine-tuning means it doesn't just track context — it reasons about it. Where the MoE holds an opinion and remembers why you challenged it, the 27B articulates the underlying reasoning without being asked. The trade-off is hardware: ~27GB at Q8, realistic inference at 1.5-2 t/s on a 4090 with offloading.

The MoE is the sweet spot for anyone with a 24GB GPU who wants depth beyond the 8B without the inference cost of running 27B dense.

Training Configuration

Detail Value
Method LoRA + rsLoRA (rank-stabilized)
Rank 32
Epochs 2
Effective Batch 16 (batch 4 × grad accum 4)
Learning Rate 1e-4 (halved vs dense models for MoE stability)
Warmup 8% (higher than dense for smoother expert adaptation)
Precision bf16 (full, not quantized)
Attention SDPA
Optimizer AdamW
Hardware NVIDIA H200 SXM 141GB
Training Time ~21 minutes
Frozen Modules Gate, router, shared_expert_gate

On learning rate and warmup: The MoE uses half the learning rate (1e-4 vs 2e-4) and higher warmup (8% vs 5%) compared to the dense models. Expert networks are more sensitive to sudden parameter shifts because the routing layer can amplify small weight changes across different expert pathways. The more conservative schedule keeps adaptation stable without sacrificing convergence.

On LoRA choice: DoRA was the original plan across the family, but newer PEFT versions broke DoRA's magnitude vector during gradient checkpointing before training began. The entire V3 lineup — dense and MoE — was trained with standard LoRA + rsLoRA from the start. rsLoRA's rank-stabilized scaling compensates for the lost magnitude decomposition. Details in the full methodology.

Recommended Settings

Parameter Value
Temperature 0.7–0.8
Top-P 0.9
Repetition Penalty 1.05–1.1
Context 4096 tokens

Quants Available

Quant Use Case
Q4_K_M Best for consumer hardware — fast, light, and capable
Q6_K Higher fidelity
Q8_0 Near-lossless

Opus Candid Model Family

Model Size Base Status
Opus-Candid-Lite-4B 4B Qwen 3 4B Active
Opus-Candid-Lite-4B-P 4B Qwen 3 4B Active
Opus-Candid-Lite-4B-K 4B Qwen 3 4B Active
Opus-Candid-8B-V3 8B Qwen 3 8B Active
Opus-Candid-MoE-V3 (this model) 31B/3B Qwen 3 30B-A3B Active
Opus-Candid-27B-V3 27B Qwen 3.5 27B Active
Opus-Candid-27B-V3.5 27B Qwen 3.5 27B Active
STEM-Oracle-27B 27B Qwen 3.5 27B Active
Opus-Candid-8B-V1 8B Qwen 2.5 7B Legacy
Opus-Research-8B-V1.5 8B Qwen 2.5 7B Legacy
Opus-Candid-8B-V2 8B Qwen 2.5 7B Legacy
Opus-Candid-8B-V2.1 8B Qwen 2.5 7B Legacy
Opus-Candid-14B-V1 14B Qwen 2.5 14B Legacy
Opus-Candid-27B-V2.1 27B Qwen 2.5 27B Legacy
Opus-Candid-32B-V1 32B Qwen 2.5 32B Legacy
Opus-Candid-MoE-V2 35B Qwen 2.5 MoE Legacy
Opus-Candid-70B-V1 72B Qwen 2.5 72B Legacy

Built by Verdugie — March 2026 License: Apache 2.0. Open weight.

Downloads last month
77
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Verdugie/Opus-Candid-MoE-V3

Quantized
(110)
this model