Instructions to use chenzeyang1/T with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use chenzeyang1/T with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("chenzeyang1/T", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- TherA: Thermal-Aware Visual-Language Prompting for
Controllable RGB-to-Thermal Infrared Translation- News
- Overview
- Key Idea
- Repository Layout
- Installation
- Download Weights
- Optional: Download LLaVA Weights
- Quick Start
- Full RGB-TIR translation using TherA-VLM
- Reference-cache Mode
- Inference Modes
- Reference Cache Format
- Important Arguments
- Architecture
- R2T2 Dataset
- Troubleshooting
- TODOs
- Citation
- Acknowledgements
- License
- Contact
- News
TherA: Thermal-Aware Visual-Language Prompting for
Controllable RGB-to-Thermal Infrared Translation
Dong-Guw Lee1*β
Tai Hyoung Rhee1*β
Hyunsoo Jang1
Young-Sik Shin2β
Ukcheol Shin3β
Ayoung Kim1β
1Seoul National Universityβ
2Kyungpook National Universityβ
3KENTECH
* Equal Contributionβ
β Corresponding Author
CVPR 2026
News
- 2026-04-03: TherA github repo opening
- 2026-05-22: TherA inference code and R2T2 dataset release.
Overview
TherA is a controllable RGB-to-thermal infrared translation framework. Given an RGB image, TherA synthesizes a long-wave thermal infrared image using a latent-diffusion translator conditioned on thermal-aware visual-language features.
TherA is designed for:
- RGB β TIR translation for thermal perception research.
- Thermal-aware VLM conditioning using LLaVA hidden-state features.
- Scene- and object-level controllability across weather, time of day, and object state.
- Reference-cache inference, allowing deployment without loading LLaVA at runtime.
Key Idea
TherA does not condition directly on raw text during diffusion inference. Instead, it uses a 4096-dimensional LLaVA hidden state, either:
- loaded from a precomputed
.ptreference cache, or - extracted on the fly using LLaVA.
For resource limited environments, we recommend reference-cache mode. This mode uses precomputed LLaVA features such as SUNNY.pt, CLOUDY.pt, RAINY.pt, or NIGHT.pt, and therefore does not require loading LLaVA weights at runtime. An alternative would be to compute pre-computed LLaVA feature first followed by inferencing with reference-cache mode (upcoming feature).
Repository Layout
TherA/
βββ infer_custom.py # Batch RGB β TIR inference on a folder
βββ infer_example_guided.py # Single-image / example-guided inference
βββ infer_palette.sh # Run multiple weather/style palettes
βββ lavi_ip2p/ # UNet 8-channel + adapter wrapper
βββ LaVi-Bridge/modules/ # TextAdapter architecture
βββ llava/ # LLaVA code, only needed for on-the-fly mode
βββ thera_paths.py # Default local weight paths
βββ thera_llava.py # Lazy LLaVA loader
βββ weights/ # Download weights here; not tracked by git
βββ model.pt # TherA Model
βββ merged_models/ # Initialization model
β βββ unet/
β βββ adapter/
βββ stable-diffusion/
β βββ vae/
β βββ scheduler/
βββ reference_caches/
β βββ SUNNY.pt
β βββ CLOUDY.pt
β βββ RAINY.pt
β βββ NIGHT.pt
βββ reference_caches/
β β βββ SUNNY.pt
β β βββ CLOUDY.pt
β β βββ RAINY.pt
β β βββ NIGHT.pt
βββ TherA-VLM/ # Optional; only for on-the-fly mode
βββ adaptor_config.json/
βββ adapter_model.safetensors
βββ config.json
βββ non_lora_trainables.bin
βββ trainer_state.json
Installation
Option 1: Local Python Environment
git clone https://github.com/donkeymouse/TherA.git
cd TherA
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Recommended environment
- Python 3.10+
- CUDA-capable GPU
- 16 GB+ VRAM recommended for comfortable inference
Option 2: Docker
A prebuilt Docker image is available at:
docker pull donkeymouse/thera:latest
Example interactive run:
docker run --gpus all --rm -it \
-v "$(pwd)":/workspace/TherA \
-w /workspace/TherA \
donkeymouse/thera:latest \
bash
Then run inference commands from inside the container.
Download Weights
TherA weights are hosted on Hugging Face:
pip install -U huggingface_hub
huggingface-cli download donkeymouse/TherA \
--local-dir weights
After downloading, your weights/ directory should contain:
| Path | Description | Required? |
|---|---|---|
weights/model.pt |
TherA trained UNet and adapter checkpoint | Yes |
weights/merged_models/unet/ |
UNet architecture/config files | Yes |
weights/merged_models/adapter/ |
TextAdapter architecture/config files | Yes |
weights/stable-diffusion/vae/ |
Stable Diffusion VAE | Yes |
weights/stable-diffusion/scheduler/ |
DDIM scheduler config | Yes |
weights/reference_caches/*.pt |
Precomputed LLaVA hidden states for inference palettes | Recommended |
weights/TherA-VLM/ |
LLaVA weights for on-the-fly feature extraction | Optional |
Optional: Download LLaVA Weights
LLaVA is only required for on-the-fly feature extraction or two-image guided mode. It is not required for reference-cache inference.
huggingface-cli download llava-hf/llava-1.5-7b-hf \
--local-dir weights/llava-1.5-7b-hf
Quick Start
Full RGB-TIR translation using TherA-VLM
Use this mode if you want to extract hidden states from TherA directly at runtime from an RGB image and prompt.
python infer_custom.py \
--rgb-dir examples/rgb \
--output-dir preds \
--llava-base-path weights/llava-1.5-7b-hf \
--llava-lora-path weights/TherA-VLM \
--llava-prompt "How would this RGB scene appear in long-wave thermal infrared spectrum."
This mode is more expensive because it loads LLaVA during inference.
Reference-Guided Image Translation Mode
This mode extracts LLaVA features from a reference RGB image and applies them to a target RGB image.
python infer_example_guided.py \
--mode two-image \
--reference-image examples/ref/rgb.jpg \
--input-image examples/rgb/scene.jpg \
--output preds/scene_tir.png \
--llava-base-path weights/llava-1.5-7b-hf \
--llava-lora-path weights/TherA-VLM
Recursive Folder Inference
python infer_custom.py \
--rgb-dir /path/to/dataset/RGB \
--output-dir preds \
--reference-cache weights/reference_caches/SUNNY.pt \
--recursive
When --recursive is used, the output folder preserves the input directory structure.
Reference-cache Mode
Reference-cache mode is the recommended if you are lacking GPU memory. It does not load LLaVA at runtime.
python infer_custom.py \
--rgb-dir examples/rgb \
--output-dir preds/sunny \
--reference-cache weights/reference_caches/SUNNY.pt
The script reads all images in examples/rgb and writes translated TIR images to preds/sunny.
A lighter version of the text-guided image translation module.
Example palette caches:
weights/reference_caches/SUNNY.pt
weights/reference_caches/CLOUDY.pt
weights/reference_caches/RAINY.pt
weights/reference_caches/NIGHT.pt
You can use different pallete cache to achieve different translation effects.
Inference Modes
| Mode | Main flag / script | LLaVA weights needed? | Recommended use |
|---|---|---|---|
| Reference cache | --reference-cache path.pt |
No | Default deployment and fast inference |
| Per-image cache directory | --cache-dir dir/ |
No | Precomputed feature per image |
| Full RGB-TIR translation | --llava-base-path ... |
Yes | Runtime prompt/image conditioning |
| Reference image-guided translation | infer_example_guided.py --mode two-image |
Yes | Apply reference-image conditioning |
Reference Cache Format
Reference caches are precomputed LLaVA hidden states saved as .pt files.
Supported tensor shapes:
[1, L, 4096]
[L, 4096]
A single reference cache can be applied to all input images as a global thermal/weather/style condition.
Important Arguments
| Argument | Default | Description |
|---|---|---|
--checkpoint |
weights/checkpoint |
Directory containing model.pt |
--merged-model-path |
weights/merged_models |
Directory containing UNet and adapter configs |
--pretrained-sd |
weights/stable-diffusion |
Directory containing VAE and scheduler |
--rgb-dir |
Required | Folder of RGB images for batch inference |
--output-dir |
custom_predictions |
Output folder for predictions |
--reference-cache |
None |
Single .pt cache used for all images |
--cache-dir |
None |
Folder of per-image .pt caches matched by filename stem |
--llava-base-path |
None |
Base LLaVA model path for on-the-fly mode |
--llava-lora-path |
None |
Optional LLaVA LoRA path |
--llava-prompt |
thermal prompt | Prompt used for default inference/text-guided translation |
--num-steps |
100 |
DDIM sampling steps |
--cfg-text |
3.5 |
Text/VLM guidance strength |
--cfg-image |
1.5 |
Image guidance strength |
--target-size |
Auto | Resize image to this square size; otherwise dimensions are rounded to multiples of 32 |
--recursive |
Off | Recursively process subdirectories |
--device |
cuda |
Device for inference |
Architecture
RGB image
β
βΌ
VAE encoder βββΊ RGB latents ββββββββββββββββββββ
β
ββββΊ 8-channel diffusion UNet βββΊ VAE decoder βββΊ TIR image
β
LLaVA hidden state, 4096-d βββΊ TextAdapter ββββββ
768-d cross-attention tokens
TherA uses dual classifier-free guidance at inference by combining:
- full conditioning,
- image-only conditioning,
- text/VLM-only conditioning.
R2T2 Dataset
TherA is trained with R2T2, a large-scale RGBβTIRβText dataset.
R2T2 includes:
- 112,970 aligned triplets: RGB image, TIR image, and canonical thermal schema.
- Scene diversity across driving, CCTV, aerial, and ego-view settings.
- Temporal diversity across day/night and diurnal transitions.
- Environmental diversity across weather, season, and illumination.
- Material- and object-level annotations with structured canonicalization.
- Data compiled from multiple aligned RGBβTIR datasets with additional pseudo-aligned pairs.
Dataset page:
https://huggingface.co/datasets/donkeymouse/TherA-R2T2
Example structure:
R2T2/
βββ ${DATASET_NAME}/
β βββ ${SEQUENCE_NAME}/
β βββ RGB/
β β βββ 1.jpg
β β βββ ...
β βββ TIR/
β βββ 1.jpg
β βββ ...
βββ ViVID/
β βββ img_campus_day1/
β β βββ RGB/
β β β βββ 000001.png
β β β βββ ...
β β βββ TIR/
β β βββ 000001.png
β β βββ ...
β βββ ...
βββ ...
Troubleshooting
Checkpoint not found: weights/model.pt
Download the TherA weights and make sure model.pt is located at:
weights/model.pt
OSError: ... stable-diffusion/vae
Make sure the Stable Diffusion VAE and scheduler folders are present:
weights/stable-diffusion/vae/
weights/stable-diffusion/scheduler/
Outputs look identical across palettes
Try increasing text/VLM guidance:
--cfg-text 7.5
Also verify that your reference cache files are distinct:
SUNNY.pt
CLOUDY.pt
RAINY.pt
NIGHT.pt
CUDA out of memory
Try reducing the image size:
--target-size 512
You can also run one image at a time with:
python infer_example_guided.py
LLaVA import or loading errors
Use reference-cache mode if you do not need runtime LLaVA extraction:
--reference-cache weights/reference_caches/SUNNY.pt
For on-the-fly mode, make sure the LLaVA base model and TherA weights are correctly loaded`.
TODOs
- inference code and R2T2 dataset
- [] Upload cache extraction code
- [] Improve text-guidance
Citation
If you find TherA useful for your research, please cite:
@inproceedings{lee2026thera,
title = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
author = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2026}
}
You may also cite the arXiv version:
@article{lee2026thera_arxiv,
title = {TherA: Thermal-Aware Visual-Language Prompting for Controllable RGB-to-Thermal Infrared Translation},
author = {Lee, Dong-Guw and Rhee, Tai Hyoung and Jang, Hyunsoo and Shin, Young-Sik and Shin, Ukcheol and Kim, Ayoung},
journal = {arXiv preprint arXiv:2602.19430},
year = {2026}
}
Acknowledgements
TherA builds on open-source components from the vision-language and diffusion communities, including LLaVA, Stable Diffusion, Diffusers, and LaVi-Bridge-style adapter architectures.
License
See LICENSE for details.
Third-party models, datasets, and libraries retain their own licenses. Please review the licenses for LLaVA, Stable Diffusion, Hugging Face model files, and any external datasets before use.
Contact
If you have any questions, contact here please
donkeymouse@snu.ac.kr
- Downloads last month
- -