FLUX.2 Klein 4B ONNX WebGPU q4 Bundle
This repository contains a staged ONNX Runtime WebGPU bundle for running FLUX.2 Klein 4B on systems without CUDA or ROCm. It was produced for the flux2-vulkan-bridge runtime, which runs ONNX Runtime WebGPU inside headless Chromium and exposes image generation through a small local API.
Runtime code and export scripts are available at https://github.com/MarkShark2/flux2-vulkan-bridge.
Contents
- Text encoder:
flux2-klein-4b-text-encoder-q4.onnxwith weight-only q4MatMulNBitsweights. - Transformer:
flux2-klein-4b-transformer-q4.onnxwith ORT weight-only q4MatMulNBitsweights at accuracy level 4, fp16 activation math, fp32 attention score/softmax math, and fixed attention split-scale constants. - VAE decoder: split decoder:
flux2-klein-4b-vae-decoder-pre-attn-fp16.onnxplus an attention-chunk graph and a chain of post-attention stage graphs in fp16. - Tokenizer files under
tokenizer/. flux2-config.jsonplus manifest files that list the ONNX external-data shards.
The runtime loads the text encoder, transformer, and VAE decoder sequentially so all sessions do not need to be resident at the same time.
Runtime Defaults
- Resolution:
256x256by default, with dimensions divisible by16. - Text sequence length:
512. - Denoising steps:
4. - Guidance:
1.0for the distilled Klein model. - Context clipping:
0(0means disabled). - ONNX Runtime Python version used for export:
1.25.1.
Important Graph Changes
This bundle is not a plain fp16 ONNX dump. It includes the graph changes needed for the current WebGPU runtime:
- Learned text encoder and transformer
MatMulweights are quantized to ONNX Runtimecom.microsoft.MatMulNBitsq4 format with block size 16. The transformer q4 nodes use accuracy level 4 for the WebGPU fp16-accumulate path. - Transformer weights use the q4 algorithm recorded in
flux2-config.jsonwhile leaving most transformer activations in fp16 to reduce runtime memory. - Attention score MatMuls and Softmax are promoted to fp32, then probabilities are cast back to fp16 before value MatMuls. This keeps the numerically sensitive attention path finite without clamping latents or text context.
- Dynamic attention split-scale chains are replaced with a fixed float32 scalar for head dimension 128. This avoids an ONNX Runtime WebGPU scalar
Sqrt/Div/Sqrtmiscompile that inflated attention scores in later blocks. - Position/RoPE math is kept in float32.
Usage
Use this model folder with flux2-vulkan-bridge:
python flux2_cli.py \
--model-dir /path/to/this/model-folder \
--prompt "A small ceramic robot on a wooden workbench, product photo, soft studio lighting" \
--output outputs/flux2_cli_webgpu.png \
--width 256 --height 256 --seed 123 --num-steps 4
The files are sharded as ONNX external data. Keep each .onnx file next to its corresponding .onnx_data shards and manifest.
Compatibility
This bundle targets ONNX Runtime WebGPU in Chromium. It is intended for the AMD BC-250 class of machine where Vulkan/WebGPU is available but CUDA and ROCm are not. Other ONNX Runtime execution providers may load parts of the graph, but the tested path is the staged WebGPU runtime in flux2-vulkan-bridge.
License And Upstream Models
This bundle is derived from upstream FLUX.2 Klein 4B, including the text encoder, tokenizer, transformer, and VAE weights stored in that repository. Use and redistribution must comply with the upstream model license and any applicable Hugging Face terms for that model.
Model tree for MarkShark2/flux2-klein-4b-onnx-webgpu-q4
Base model
black-forest-labs/FLUX.2-klein-4B