MedGemma 4B β Q3_K_M GGUF (2.0 GB)
License notice: This is a quantized derivative of
google/medgemma-1.5-4b-itand is governed by the Gemma Terms of Use. You must accept those terms on the original model page before downloading or using this file. All credit for the base model goes to Google.
Quantized by Mohammed K. A. Abed as part of the MedGemma Impact Challenge β optimised for on-device clinical inference on mobile and CPU-only hardware.
| Base model | google/medgemma-1.5-4b-it (4B parameters) |
| Original size | 7.3 GB (BF16) |
| Quantized size | 2.0 GB |
| Reduction | 73% |
| Method | GGUF Q3_K_M (3-bit k-means quantization via llama.cpp) |
| Runtime | llama.cpp / llama.rn |
| Tested on | Tecno Spark 40 Β· MediaTek Helio G100 Β· 8 GB RAM |
About this quantization
Q3_K_M uses 3-bit k-means quantization with medium-sized super-blocks. Compared to lower bit-widths:
| Variant | Bits | Size | Notes |
|---|---|---|---|
| Q4_K_M | 4.83 | 2.4 GB | Higher quality β recommended for workstation use |
| Q3_K_M | 3.07 | 2.0 GB | Best fit for 8 GB mobile RAM β used in Capsule |
| Q2_K | 2.96 | 1.5 GB | Perplexity penalty too high (+3.5 PPL) |
| IQ2_M | 2.7 | 1.3 GB | Too slow on mobile CPU |
Q3_K_M was selected after benchmarking on a mid-range Android phone (Tecno Spark 40, MediaTek Helio G100). It is the highest quality variant that fits within the phone's working RAM budget after the OS and app overhead are accounted for.
Files
| File | Description |
|---|---|
medgemma-1.5-4b-it-Q3_K_M.gguf |
Q3_K_M quantized GGUF weights |
Usage
llama.cpp CLI
./llama-cli \
-m medgemma-1.5-4b-it-Q3_K_M.gguf \
-n 512 \
--ctx-size 4096 \
--temp 0.3 \
--repeat-penalty 1.1 \
-p "<start_of_turn>user\nGenerate a SOAP note for the following transcript:\n\n{transcript}<end_of_turn>\n<start_of_turn>model\n"
llama.cpp server (workstation)
./llama-server \
-m medgemma-1.5-4b-it-Q3_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 4096 \
--n-predict 512
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="medgemma-1.5-4b-it-Q3_K_M.gguf",
n_ctx=4096,
n_threads=8,
)
prompt = (
"<start_of_turn>user\n"
"Generate a SOAP note for the following transcript:\n\n"
"{transcript}"
"<end_of_turn>\n"
"<start_of_turn>model\n"
)
output = llm(
prompt,
max_tokens=512,
temperature=0.3,
repeat_penalty=1.1,
stop=["<end_of_turn>"],
)
print(output["choices"][0]["text"])
React Native / Android (llama.rn)
This is how the model is used inside Capsule:
import { initLlama, LlamaContext } from 'llama.rn';
const context: LlamaContext = await initLlama({
model: '/data/local/tmp/medgemma.gguf',
n_ctx: 4096,
n_threads: 4,
});
const result = await context.completion({
prompt: `<start_of_turn>user\n${systemPrompt}\n\n${transcript}<end_of_turn>\n<start_of_turn>model\n`,
n_predict: 512,
temperature: 0.3,
repeat_penalty: 1.1,
stop: ['<end_of_turn>'],
});
See App.tsx for the full reference implementation including memory management (sequential loading with MedASR).
Clinical tasks demonstrated
| Task | Where |
|---|---|
| SOAP note generation from dictation transcript | On-device (phone) |
| Lab result summarisation and interpretation | On-device (phone) |
| Clinical chat with voice input | On-device (phone) |
| Agentic SOAP enhancement (DDI Β· ICD-10 Β· lab correlation) | Workstation (Q4_K_M) |
| EHR Navigator β natural-language FHIR queries | Workstation (Q4_K_M) |
| Radiology report generation (8 imaging modalities) | Workstation + mmproj |
Performance
Measured on Tecno Spark 40 (MediaTek Helio G100, 8 GB RAM, CPU-only):
| Metric | Value |
|---|---|
| SOAP note generation | ~60 s |
| Peak memory during inference | ~3.2 GB |
| Battery impact | < 3% per hour of active use |
| Model load time | ~8 s (pre-loaded during transcript review) |
No GPU required. Also tested on Ryzen 7 8845HS (32 GB RAM) where the Q4_K_M variant is used for the workstation pipeline.
Quantization command
# Convert base model weights to GGUF (run inside llama.cpp repo)
python convert_hf_to_gguf.py \
/path/to/medgemma-1.5-4b-it \
--outfile medgemma-1.5-4b-it-f16.gguf \
--outtype f16
# Quantize to Q3_K_M
./llama-quantize \
medgemma-1.5-4b-it-f16.gguf \
medgemma-1.5-4b-it-Q3_K_M.gguf \
Q3_K_M
Memory layout on device
MedASR (101 MB ONNX) and MedGemma (2.0 GB GGUF) are loaded sequentially β never simultaneously β to stay within 8 GB RAM:
Record dictation β Load MedASR (101 MB) β Transcribe β Unload MedASR
β Pre-load MedGemma during transcript review (2.0 GB)
β Generate SOAP note β Unload MedGemma (on demand)
Total on-device footprint: 2.1 GB active at any one time.
License
This model is derived from google/medgemma-1.5-4b-it and inherits the Gemma Terms of Use. You must accept those terms before downloading or using this model.
Citation
If you use this model, please cite the original MedGemma work and acknowledge the Capsule project:
@misc{capsule2026,
title = {Capsule: Edge AI Clinical Documentation with Agentic Intelligence},
author = {Mohammed K. A. Abed},
year = {2026},
url = {https://github.com/mo-saif/capsule}
}
- Downloads last month
- 87
3-bit
Model tree for moisf56/medgemma-4b-q3km-gguf
Base model
google/medgemma-1.5-4b-it