ISTA-DASLab
/

Llama-3.2-3B-Instruct-FPQuant-QAT-MXFP4

8-bit precision

Model card Files Files and versions

Llama-3.2-3B-Instruct-FPQuant-QAT-MXFP4 / README.md

BlackSamorez's picture

Upload README.md with huggingface_hub

4911632 verified about 2 months ago

|

history blame contribute delete

1.13 kB

	This is the official QAT FP-Quant checkpoint of `meta-llama/Llama-3.2-3B-Instruct`, produced as described in the ["Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization"](https://arxiv.org/abs/2509.23202) paper.

	This model can be run on Blackwell-generation NVIDIA GPUs via [QuTLASS](https://github.com/IST-DASLab/qutlass) and [FP-Quant](https://github.com/IST-DASLab/FP-Quant) in either [transformers](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant) or [vLLM](https://github.com/vllm-project/vllm/pull/24440).

	The approximate recipe for training this model (up to local batch size and LR) is available [here](https://github.com/IST-DASLab/nanochat-qat/blob/qat/transformers_distill.py).

	This checkpoint has the following performance relative to the original model and the RTN quantization:

	\| Model \| MMLU \| GSM8k \| Hellaswag \| Winogrande \| Avg \|
	\|-------\|------\|-------\|-----------\|------------\|-----\|
	\| `meta-llama/Llama-3.2-3B-Instruct` \| 64.4 \| 78.0 \| 73.4 \| 70.1 \| 71.5 \|
	\| RTN \| 55.6 \| 57.8 \| 68.6 \| 64.3 \| 61.6 \|
	\| QAT (THIS) \| 59.8 \| 72.5 \| 70.3 \| 66.5 \| 67.3 \|