FLUX.2 Klein 4B ONNX WebGPU q4 Bundle

This repository contains a staged ONNX Runtime WebGPU bundle for running FLUX.2 Klein 4B on systems without CUDA or ROCm. It was produced for the flux2-vulkan-bridge runtime, which runs ONNX Runtime WebGPU inside headless Chromium and exposes image generation through a small local API.

Runtime code and export scripts are available at https://github.com/MarkShark2/flux2-vulkan-bridge.

Contents

  • Text encoder: flux2-klein-4b-text-encoder-q4.onnx with weight-only q4 MatMulNBits weights.
  • Transformer: flux2-klein-4b-transformer-q4.onnx with ORT weight-only q4 MatMulNBits weights at accuracy level 4, fp16 activation math, fp32 attention score/softmax math, and fixed attention split-scale constants.
  • VAE decoder: split decoder: flux2-klein-4b-vae-decoder-pre-attn-fp16.onnx plus an attention-chunk graph and a chain of post-attention stage graphs in fp16.
  • Tokenizer files under tokenizer/.
  • flux2-config.json plus manifest files that list the ONNX external-data shards.

The runtime loads the text encoder, transformer, and VAE decoder sequentially so all sessions do not need to be resident at the same time.

Runtime Defaults

  • Resolution: 256x256 by default, with dimensions divisible by 16.
  • Text sequence length: 512.
  • Denoising steps: 4.
  • Guidance: 1.0 for the distilled Klein model.
  • Context clipping: 0 (0 means disabled).
  • ONNX Runtime Python version used for export: 1.25.1.

Important Graph Changes

This bundle is not a plain fp16 ONNX dump. It includes the graph changes needed for the current WebGPU runtime:

  • Learned text encoder and transformer MatMul weights are quantized to ONNX Runtime com.microsoft.MatMulNBits q4 format with block size 16. The transformer q4 nodes use accuracy level 4 for the WebGPU fp16-accumulate path.
  • Transformer weights use the q4 algorithm recorded in flux2-config.json while leaving most transformer activations in fp16 to reduce runtime memory.
  • Attention score MatMuls and Softmax are promoted to fp32, then probabilities are cast back to fp16 before value MatMuls. This keeps the numerically sensitive attention path finite without clamping latents or text context.
  • Dynamic attention split-scale chains are replaced with a fixed float32 scalar for head dimension 128. This avoids an ONNX Runtime WebGPU scalar Sqrt/Div/Sqrt miscompile that inflated attention scores in later blocks.
  • Position/RoPE math is kept in float32.

Usage

Use this model folder with flux2-vulkan-bridge:

python flux2_cli.py \
    --model-dir /path/to/this/model-folder \
    --prompt "A small ceramic robot on a wooden workbench, product photo, soft studio lighting" \
    --output outputs/flux2_cli_webgpu.png \
    --width 256 --height 256 --seed 123 --num-steps 4

The files are sharded as ONNX external data. Keep each .onnx file next to its corresponding .onnx_data shards and manifest.

Compatibility

This bundle targets ONNX Runtime WebGPU in Chromium. It is intended for the AMD BC-250 class of machine where Vulkan/WebGPU is available but CUDA and ROCm are not. Other ONNX Runtime execution providers may load parts of the graph, but the tested path is the staged WebGPU runtime in flux2-vulkan-bridge.

License And Upstream Models

This bundle is derived from upstream FLUX.2 Klein 4B, including the text encoder, tokenizer, transformer, and VAE weights stored in that repository. Use and redistribution must comply with the upstream model license and any applicable Hugging Face terms for that model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MarkShark2/flux2-klein-4b-onnx-webgpu-q4

Quantized
(21)
this model