keep doing it!
#4
by pirola - opened
I am a fan already! These models are perfect for being 4-bit quantized to run on more constrained GPUs like mine with 16Gb!
Thanks! We just released the code https://github.com/SamsungSAILMontreal/ream so other models can be REAMed.
I am working on a REAMed version of Nemotron Cascade 2, but boy, it's far away from good. I will try to use your complete process now and see whether I get better results. Thanks for the contribution!
I cannot completly apply your methodology there, though. Curretly:
- Expert selection uses sigmoid (Nemotron's routing) while REAM uses softmax-selected top-k
- Alignment uses [up_row β down_col] not [gate_row β up_row β down_col] (non-gated MoE)
- e_score_correction_bias sliced alongside gate.weight (Nemotron-specific)
would you agree with that, or suggest something otherwise?