RayyanAhmed9477
/

Z-Image-Turbo-LORA-Adaptor

Text-to-Image

Model card Files Files and versions

xet

Community

RayyanAhmed9477 commited on 11 days ago

Commit

f3576b2

verified ·

1 Parent(s): 2f34ab2

tags

Browse files

Files changed (1) hide show

README.md +125 -116

README.md CHANGED Viewed

@@ -1,116 +1,125 @@
-# Z-Image-Turbo Hosted
-## Overview
-This repository hosts a fine-tuned version of the Z-Image-Turbo model, specifically the training adapter from [ostris/zimage_turbo_training_adapter](https://huggingface.co/ostris/zimage_turbo_training_adapter). The original Z-Image-Turbo is developed by Tongyi-MAI and available at [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo).
-## Why This Model?
-Z-Image-Turbo is a state-of-the-art text-to-image diffusion model based on a Single-Stream Diffusion Transformer (S3-DiT) architecture. It offers several advantages:
-- **Efficiency**: Distilled for high performance with only 8 Number of Function Evaluations (NFEs), enabling sub-second inference on high-end GPUs.
-- **Quality**: Excels in photorealistic image generation, bilingual text rendering (English and Chinese), and prompt adherence.
-- **Scalability**: Supports resolutions up to 1024x1024 pixels.
-- **Compatibility**: Works with guidance_scale=0.0 for Turbo variants, reducing computational overhead.
-We chose this model for our project due to its balance of speed and quality, making it ideal for real-time applications and local inference on consumer hardware like the RTX 3090.
-The training adapter enhances the base model by providing fine-tuned weights for specific use cases, improving adaptability without retraining from scratch.
-## Technical Details
-### Model Architecture
-- **Base Model**: Z-Image-Turbo (6B parameters)
-- **Architecture**: Single-Stream Diffusion Transformer (S3-DiT)
-- **Training Data**: Not specified in public docs, but likely large-scale image-text pairs for photorealism.
-- **Quantization**: The hosted version supports quantization for reduced memory usage (e.g., 8-bit or 4-bit using bitsandbytes).
-### Hosting Process
-1. **Selection**: Identified Z-Image-Turbo as the best fit for our needs based on benchmarks showing superior speed vs. quality trade-off compared to models like FLUX or SDXL.
-2. **Source**: Used the training adapter from ostris for pre-fine-tuned weights.
-3. **Authentication**: Logged into Hugging Face using a personal access token.
-4. **Repository Creation**: Created a new model repository on Hugging Face.
-5. **Download**: Downloaded all model files (safetensors, config, etc.) from the source repo.
-6. **Upload**: Uploaded the files to the new repo using the Hugging Face Hub API.
-7. **Documentation**: Added this README with citations to original authors.
-### Quantization Techniques
-To enable local inference on hardware with limited VRAM, we support various quantization methods:
-- **BitsandBytes (Recommended)**:
-  - 8-bit: Reduces memory by ~50%, minimal quality loss.
-  - 4-bit: Further reduction to ~25% memory, with NF4 or FP4 configurations.
-  - Code:
-    ```python
-    from transformers import BitsAndBytesConfig
-    quantization_config = BitsAndBytesConfig(load_in_8bit=True)  # or load_in_4bit=True
-    pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", quantization_config=quantization_config)
-    ```
-- **GGUF Quantization**:
-  - For extreme low-VRAM (4GB+), use stable-diffusion.cpp with GGUF versions.
-  - Download from community repos like jayn7/Z-Image-Turbo-GGUF.
-- **FP8 Quantization**:
-  - 8-bit float for balanced performance.
-  - Available in repos like T5B/Z-Image-Turbo-FP8.
-### Benchmarks and Comparisons
-- **vs. FLUX**: Z-Image-Turbo offers faster inference (8 NFEs vs. FLUX's 28-50) with comparable quality for photorealism.
-- **vs. SDXL**: Better prompt adherence and bilingual support; distilled for efficiency.
-- **Performance on RTX 3090**:
-  - Full precision: 5-10s per image, 12GB VRAM.
-  - 8-bit quantized: 6-8s, 6GB VRAM.
-  - Quality drop: <5% perceptible.
-### Installation Guide
-1. Install dependencies:
-   ```bash
-   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
-   pip install git+https://github.com/huggingface/diffusers
-   pip install transformers accelerate bitsandbytes
-   ```
-2. Load and run:
-   ```python
-   from diffusers import ZImagePipeline
-   import torch
-   pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", torch_dtype=torch.bfloat16)
-   pipe.to("cuda")
-   image = pipe(prompt="A futuristic cityscape", height=1024, width=1024, num_inference_steps=9, guidance_scale=0.0).images[0]
-   image.save("output.png")
-   ```
-3. For UI: Use Gradio for web interface.
-### System Requirements
-- GPU: NVIDIA with at least 16GB VRAM (e.g., RTX 3090)
-- RAM: 64GB recommended
-- Software: Python 3.8+, PyTorch 2.0+, diffusers library
-- OS: Windows/Linux with CUDA 11.8+
-### Performance
-- Inference Time: ~5-10 seconds per 1024x1024 image on RTX 3090
-- Memory Usage: ~12GB (bfloat16), reducible with quantization
-- Throughput: ~0.1-0.2 images/second
-### Troubleshooting
-- **Out of Memory**: Use quantization or CPU offloading (`pipe.enable_model_cpu_offload()`).
-- **Slow Inference**: Enable Flash Attention (`pipe.transformer.set_attention_backend("flash")`), compile model (`pipe.transformer.compile()`).
-- **Quality Issues**: Increase num_inference_steps or use higher precision.
-## Citations
-- Original Model: Tongyi-MAI. "Z-Image-Turbo." Hugging Face, https://huggingface.co/Tongyi-MAI/Z-Image-Turbo.
-- Training Adapter: ostris. "zimage_turbo_training_adapter." Hugging Face, https://huggingface.co/ostris/zimage_turbo_training_adapter.
-Hosted by RayyanAhmed9477, with all credits to original creators.
-## License
-Refer to the original repositories for licensing information.
----
-tags:
-- text-to-image
-- diffusion
-- z-image-turbo
-- photorealism
-- quantized

+---
+license: apache-2.0
+datasets:
+- PeterBrendan/AdImageNet
+base_model:
+- Tongyi-MAI/Z-Image-Turbo
+tags:
+- text-to-image
+---
+# Z-Image-Turbo Hosted
+## Overview
+This repository hosts a fine-tuned version of the Z-Image-Turbo model, specifically the training adapter from [ostris/zimage_turbo_training_adapter](https://huggingface.co/ostris/zimage_turbo_training_adapter). The original Z-Image-Turbo is developed by Tongyi-MAI and available at [Tongyi-MAI/Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo).
+## Why This Model?
+Z-Image-Turbo is a state-of-the-art text-to-image diffusion model based on a Single-Stream Diffusion Transformer (S3-DiT) architecture. It offers several advantages:
+- **Efficiency**: Distilled for high performance with only 8 Number of Function Evaluations (NFEs), enabling sub-second inference on high-end GPUs.
+- **Quality**: Excels in photorealistic image generation, bilingual text rendering (English and Chinese), and prompt adherence.
+- **Scalability**: Supports resolutions up to 1024x1024 pixels.
+- **Compatibility**: Works with guidance_scale=0.0 for Turbo variants, reducing computational overhead.
+We chose this model for our project due to its balance of speed and quality, making it ideal for real-time applications and local inference on consumer hardware like the RTX 3090.
+The training adapter enhances the base model by providing fine-tuned weights for specific use cases, improving adaptability without retraining from scratch.
+## Technical Details
+### Model Architecture
+- **Base Model**: Z-Image-Turbo (6B parameters)
+- **Architecture**: Single-Stream Diffusion Transformer (S3-DiT)
+- **Training Data**: Not specified in public docs, but likely large-scale image-text pairs for photorealism.
+- **Quantization**: The hosted version supports quantization for reduced memory usage (e.g., 8-bit or 4-bit using bitsandbytes).
+### Hosting Process
+1. **Selection**: Identified Z-Image-Turbo as the best fit for our needs based on benchmarks showing superior speed vs. quality trade-off compared to models like FLUX or SDXL.
+2. **Source**: Used the training adapter from ostris for pre-fine-tuned weights.
+3. **Authentication**: Logged into Hugging Face using a personal access token.
+4. **Repository Creation**: Created a new model repository on Hugging Face.
+5. **Download**: Downloaded all model files (safetensors, config, etc.) from the source repo.
+6. **Upload**: Uploaded the files to the new repo using the Hugging Face Hub API.
+7. **Documentation**: Added this README with citations to original authors.
+### Quantization Techniques
+To enable local inference on hardware with limited VRAM, we support various quantization methods:
+- **BitsandBytes (Recommended)**:
+  - 8-bit: Reduces memory by ~50%, minimal quality loss.
+  - 4-bit: Further reduction to ~25% memory, with NF4 or FP4 configurations.
+  - Code:
+    ```python
+    from transformers import BitsAndBytesConfig
+    quantization_config = BitsAndBytesConfig(load_in_8bit=True)  # or load_in_4bit=True
+    pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", quantization_config=quantization_config)
+    ```
+- **GGUF Quantization**:
+  - For extreme low-VRAM (4GB+), use stable-diffusion.cpp with GGUF versions.
+  - Download from community repos like jayn7/Z-Image-Turbo-GGUF.
+- **FP8 Quantization**:
+  - 8-bit float for balanced performance.
+  - Available in repos like T5B/Z-Image-Turbo-FP8.
+### Benchmarks and Comparisons
+- **vs. FLUX**: Z-Image-Turbo offers faster inference (8 NFEs vs. FLUX's 28-50) with comparable quality for photorealism.
+- **vs. SDXL**: Better prompt adherence and bilingual support; distilled for efficiency.
+- **Performance on RTX 3090**:
+  - Full precision: 5-10s per image, 12GB VRAM.
+  - 8-bit quantized: 6-8s, 6GB VRAM.
+  - Quality drop: <5% perceptible.
+### Installation Guide
+1. Install dependencies:
+   ```bash
+   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+   pip install git+https://github.com/huggingface/diffusers
+   pip install transformers accelerate bitsandbytes
+   ```
+2. Load and run:
+   ```python
+   from diffusers import ZImagePipeline
+   import torch
+   pipe = ZImagePipeline.from_pretrained("RayyanAhmed9477/Z-Image-Turbo-Hosted", torch_dtype=torch.bfloat16)
+   pipe.to("cuda")
+   image = pipe(prompt="A futuristic cityscape", height=1024, width=1024, num_inference_steps=9, guidance_scale=0.0).images[0]
+   image.save("output.png")
+   ```
+3. For UI: Use Gradio for web interface.
+### System Requirements
+- GPU: NVIDIA with at least 16GB VRAM (e.g., RTX 3090)
+- RAM: 64GB recommended
+- Software: Python 3.8+, PyTorch 2.0+, diffusers library
+- OS: Windows/Linux with CUDA 11.8+
+### Performance
+- Inference Time: ~5-10 seconds per 1024x1024 image on RTX 3090
+- Memory Usage: ~12GB (bfloat16), reducible with quantization
+- Throughput: ~0.1-0.2 images/second
+### Troubleshooting
+- **Out of Memory**: Use quantization or CPU offloading (`pipe.enable_model_cpu_offload()`).
+- **Slow Inference**: Enable Flash Attention (`pipe.transformer.set_attention_backend("flash")`), compile model (`pipe.transformer.compile()`).
+- **Quality Issues**: Increase num_inference_steps or use higher precision.
+## Citations
+- Original Model: Tongyi-MAI. "Z-Image-Turbo." Hugging Face, https://huggingface.co/Tongyi-MAI/Z-Image-Turbo.
+- Training Adapter: ostris. "zimage_turbo_training_adapter." Hugging Face, https://huggingface.co/ostris/zimage_turbo_training_adapter.
+Hosted by RayyanAhmed9477, with all credits to original creators.
+## License
+Refer to the original repositories for licensing information.
+---
+tags:
+- text-to-image
+- diffusion
+- z-image-turbo
+- photorealism
+- quantized