BiliSakura/PixelDiT-diffusers

Self-contained PixelDiT checkpoints for Hugging Face diffusers. Each variant folder ships its own pipeline.py, component modules, and weights.

Converted from nvidia/PixelDiT-ImageNet and nvidia/PixelDiT-1300M-1024px using PixelDiT-diffusers.

Available checkpoints

Subfolder Pipeline Task Resolution Source checkpoint gFID Params
PixelDiT-T2I-1024/ PixelDiTT2IPipeline text-to-image 1024Γ—1024 pixeldit_t2i_v1.pth β€” ~1.3B
PixelDiT-XL-16-256/ PixelDiTPipeline class-to-image 256Γ—256 imagenet256_pixeldit_xl_epoch320.ckpt 1.61 ~700M
PixelDiT-XL-16-512/ PixelDiTPipeline class-to-image 512Γ—512 imagenet512_pixeldit_xl.ckpt 1.81 ~700M

Repo layout

BiliSakura/PixelDiT-diffusers/
β”œβ”€β”€ README.md
β”œβ”€β”€ demo_inference.py
β”œβ”€β”€ PixelDiT-T2I-1024/
β”‚   β”œβ”€β”€ pipeline.py
β”‚   β”œβ”€β”€ model_index.json
β”‚   β”œβ”€β”€ demo.png
β”‚   β”œβ”€β”€ scheduler/scheduler_config.json
β”‚   └── transformer/
β”œβ”€β”€ PixelDiT-XL-16-256/
β”‚   β”œβ”€β”€ pipeline.py
β”‚   β”œβ”€β”€ model_index.json
β”‚   β”œβ”€β”€ demo.png
β”‚   β”œβ”€β”€ scheduler/scheduler_config.json
β”‚   └── transformer/
└── PixelDiT-XL-16-512/
    β”œβ”€β”€ pipeline.py
    β”œβ”€β”€ model_index.json
    β”œβ”€β”€ scheduler/scheduler_config.json
    └── transformer/

Each variant is self-contained. The scheduler/ folder uses built-in FlowMatchEulerDiscreteScheduler from PyPI diffusers. No shared helper modules at inference time beyond the local variant directory.

ImageNet class labels

id2label is embedded in each variant's model_index.json (DiT-style).

  • pipe.id2label β€” inspect id β†’ English label correspondence
  • pipe.labels β€” reverse map (English synonym β†’ id)
  • pipe.get_label_ids("golden retriever")
  • pipe(class_labels="golden retriever", ...) β€” string labels resolved automatically

Demo

PixelDiT-T2I-1024 demo

Text-to-image β€” "A golden retriever playing in a sunny garden", 1024Γ—1024, 50 steps, guidance_scale=2.75.

python demo_inference_t2i.py

PixelDiT-XL-16-256 demo

Class 207 β€” golden retriever, 256Γ—256, 100 steps, guidance_scale=2.75, CFG interval [0.1, 0.9].

python demo_inference.py

Load from a local clone

Text-to-image 1024Γ—1024 (PixelDiT-T2I-1024)

from pathlib import Path
import torch
from diffusers import DiffusionPipeline

model_dir = Path("./PixelDiT-T2I-1024").resolve()
pipe = DiffusionPipeline.from_pretrained(
    str(model_dir),
    local_files_only=True,
    custom_pipeline=str(model_dir / "pipeline.py"),
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
    prompt="A golden retriever playing in a sunny garden",
    negative_prompt="low quality, worst quality, over-saturated, blurry, deformed, watermark",
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=2.75,
    generator=generator,
).images[0]
image.save("demo.png")

Gemma text encoder (google/gemma-2-2b-it) is downloaded on first run unless bundled under text_encoder/.

ImageNet 256Γ—256 (PixelDiT-XL-16-256)

from pathlib import Path
import torch
from diffusers import DiffusionPipeline

model_dir = Path("./PixelDiT-XL-16-256").resolve()
pipe = DiffusionPipeline.from_pretrained(
    str(model_dir),
    local_files_only=True,
    custom_pipeline=str(model_dir / "pipeline.py"),
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

print(pipe.id2label[207])
print(pipe.get_label_ids("golden retriever"))

generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
    class_labels="golden retriever",
    height=256,
    width=256,
    num_inference_steps=100,
    guidance_scale=2.75,
    guidance_interval_min=0.1,
    guidance_interval_max=0.9,
    generator=generator,
).images[0]
image.save("demo.png")

ImageNet 512Γ—512 (PixelDiT-XL-16-512)

from pathlib import Path
import torch
from diffusers import DiffusionPipeline

model_dir = Path("./PixelDiT-XL-16-512").resolve()
pipe = DiffusionPipeline.from_pretrained(
    str(model_dir),
    local_files_only=True,
    custom_pipeline=str(model_dir / "pipeline.py"),
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

generator = torch.Generator(device="cuda").manual_seed(42)
image = pipe(
    class_labels=207,
    height=512,
    width=512,
    num_inference_steps=100,
    guidance_scale=3.5,
    guidance_interval_min=0.1,
    guidance_interval_max=1.0,
    generator=generator,
).images[0]
image.save("demo.png")

Recommended inference settings

Variant Steps CFG scale Scheduler shift CFG interval
PixelDiT-T2I-1024 50 2.75 4.0 [0.0, 1.0]
PixelDiT-XL-16-256 100 2.75 1.0 [0.1, 0.9]
PixelDiT-XL-16-512 100 3.5 2.0 [0.1, 1.0]

PixelDiT denoises directly in pixel space (no VAE). height and width must be divisible by the patch size (16).

Conversion

cd libs/PixelDiT-diffusers

python scripts/convert_pixeldit_t2i_to_diffusers.py \
  --checkpoint /path/to/pixeldit_t2i_v1.pth \
  --config /path/to/config.json \
  --output /path/to/PixelDiT-T2I-1024 \
  --sample-size 1024 \
  --scheduler-shift 4.0 \
  --check-load

python scripts/convert_pixeldit_to_diffusers.py \
  --checkpoint /path/to/imagenet256_pixeldit_xl_epoch320.ckpt \
  --output /path/to/PixelDiT-XL-16-256 \
  --model-size pixeldit-xl \
  --sample-size 256 \
  --scheduler-shift 1.0 \
  --check-load \
  --id2label /path/to/id2label_en.json

Citation

@inproceedings{yu2025pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}

License

Weights are converted from NVIDIA checkpoints released under the NSCLv1 License. Use for non-commercial research and evaluation only.

Downloads last month
-
Inference Examples
Examples
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support