keep doing it!

#4
by pirola - opened

I am a fan already! These models are perfect for being 4-bit quantized to run on more constrained GPUs like mine with 16Gb!

Samsung AI Lab (SAIL) Montreal org

Thanks! We just released the code https://github.com/SamsungSAILMontreal/ream so other models can be REAMed.

I am working on a REAMed version of Nemotron Cascade 2, but boy, it's far away from good. I will try to use your complete process now and see whether I get better results. Thanks for the contribution!
I cannot completly apply your methodology there, though. Curretly:

  • Expert selection uses sigmoid (Nemotron's routing) while REAM uses softmax-selected top-k
  • Alignment uses [up_row β€– down_col] not [gate_row β€– up_row β€– down_col] (non-gated MoE)
  • e_score_correction_bias sliced alongside gate.weight (Nemotron-specific)

would you agree with that, or suggest something otherwise?

Sign up or log in to comment