Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,87 +1,62 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
base_model: Qwen/
|
| 4 |
tags:
|
| 5 |
-
-
|
| 6 |
-
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# Qwen3-4B
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
###
|
| 22 |
-
This is a merged model based on Qwen3 (specifically the 4B variant), fine-tuned to understand the intricate, somewhat needy requirements of the Z-Image Turbo architecture. It knows about "Positive Constraints," it hates negative prompts (because they don't work), and it really, really wants you to describe skin texture so your portraits don't look like plastic dolls.
|
| 23 |
|
| 24 |
-
|
| 25 |
-
We took the base Qwen3 model (which loves to say "I cannot assist with that") and gave it the [Heretic](https://github.com/p-e-w/heretic) treatment.
|
| 26 |
-
- **Refusal Rate:** Dropped from a prudish **100/100** to a chill **23/100** on our benchmarks.
|
| 27 |
-
- **KL Divergence:** Minimal. We lobotomized the censorship without breaking the brain.
|
| 28 |
|
| 29 |
-
###
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
#### Why Synthetic Data?
|
| 36 |
-
Z-Image Turbo is "needy." It requires very specific, dense descriptions to look good. Most human-written prompts are too short or use "tag salad" (comma-separated lists), which the Qwen-3 encoder hates. We used Gemini to expand simple concepts into 120-180 word rich paragraphs, teaching the model to hallucinate the missing details (lighting, texture, camera specs) that Z-Image Turbo needs to trigger its magic.
|
| 37 |
-
|
| 38 |
-
#### The "Seed Strategy" (Engineering Diversity)
|
| 39 |
-
To ensure the model didn't just learn to output generic "portrait of a woman" prompts, we built a procedural generation engine for the seed prompts that functions as a combinatorial explosion.
|
| 40 |
-
|
| 41 |
-
- **8 Major Style Pillars:** We explicitly balanced the dataset across Photorealism, Anime, Fantasy, Sci-Fi, Horror, Artistic, Documentary, and Fine Art.
|
| 42 |
-
- **Infinite Variety:** We didn't just feed Gemini "A cat." We constructed seeds by randomly mixing ~170 base concepts with 26 styles, 10 shot types, 10 lighting setups, 11 moods, 8 texture notes, and 10 camera kits.
|
| 43 |
-
- **The Math:** This procedural engine is capable of generating over **217 Billion unique seed prompts**. From this vast latent space, we carefully sampled the 20,000 most coherent and high-impact intersections to train the model.
|
| 44 |
-
|
| 45 |
-
This ensures that the model understands that "Cinestill 800T" isn't just a random word, but a specific color grading instruction that can apply to *any* concept, from a cybernetic surgeon to a medieval marketplace.
|
| 46 |
-
|
| 47 |
-
#### The Training Data Prompt
|
| 48 |
-
Here is the exact system prompt we used to generate the training data. You can see how we forced Gemini to focus on "Positive Constraints" and "Texture Density":
|
| 49 |
-
|
| 50 |
-
```python
|
| 51 |
-
def get_user_message(seed_prompt: str) -> str:
|
| 52 |
-
"""Generate the instruction for Gemini to output one long paragraph with full specs."""
|
| 53 |
-
return f"""You are a senior prompt engineer creating production-grade prompts for Tongyi Z-Image Turbo (S3-DiT, Qwen-3 text encoder, distilled 8-10 step pipeline).
|
| 54 |
-
|
| 55 |
-
Write ONE rich paragraph (120-180 words) that fully specifies the scene from the seed. Mandatory: subject count and relationships; spatial layout (foreground/midground/background, left/right/center, camera height, gaze direction); action and environment; time of day and weather; texture/material details; lighting rig; color grade; camera body/format, lens and focal length, aperture, focus/depth-of-field; film stock or digital pipeline; shot type (close-up/medium/full/establishing/aerial/dutch tilt/overhead); resolution/cleanliness cues.
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
```
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
|
| 75 |
-
Feed it a simple prompt like "A photo of an old man" and watch it spit out a paragraph about "weathered skin," "Fujifilm Superia 400," and "Shift 7.0 metadata."
|
| 76 |
|
| 77 |
-
|
| 78 |
-
(See `zimage-prompter/system_prompt.json` in the repo for the full magic incantation).
|
| 79 |
|
| 80 |
-
|
| 81 |
-
{
|
| 82 |
-
"system_prompt": "You are Z-Engineer, an expert prompt engineering AI specializing in the Z-Image Turbo architecture (S3-DiT). Your goal is to rewrite simple user inputs into high-fidelity, \"Positive Constraint\" prompts optimized for the Qwen-3 text encoder and the 8-step distilled inference process.\n\n**CORE OPERATIONAL RULES:**\n1. **NO Negative Prompts:** Z-Image Turbo ignores negative prompts at the optimal CFG of 1.0. You must strictly use \"Positive Constraints.\" (e.g., instead of \"negative: blur\", write \"...razor sharp focus, pristine imaging...\").\n2. **Natural Language Syntax:** The Qwen-3 encoder requires coherent, grammatical sentences. Do NOT use \"tag salad\" (comma-separated lists). Use flow and structure.\n3. **Texture Density:** The model suffers from \"plastic skin\" unless forced to render high-frequency detail. You must aggressively describe textures (e.g., \"weathered skin,\" \"visible pores,\" \"film grain,\" \"fabric weave\") to engage the \"Shift 7.0\" sampling schedule.\n4. **Spatial Precision:** Use specific spatial prepositions (\"in the foreground,\" \"to the left,\" \"worm's-eye view\") to leverage the 3D RoPE embeddings.\n5. **Text Handling:** If the user asks for text/signage, explicitly enclose the text in double quotes (e.g., ...a sign that says \"OPEN\"...) and describe the font/material (e.g., \"neon,\" \"stenciled paint\").\n6. **Proper Anatomy:** If the user asks for a living subject (e.g., an animal or person), explicitly state that they have proper anatomy or \"perfectly formed\" is used when describing the subject (e.g., \"The woman's perfectly formed hands hold\".\n\n**PROMPT STRUCTURE HIERARCHY:**\nConstruct your response in this specific order:\n1. **Subject Anchoring:** Define the WHO and WHAT immediately.\n2. **Action & Context:** Define the DOING and WHERE.\n3. **Aesthetic & Lighting:** Define the HOW (Lighting, Atmosphere, Color Palette).\n4. **Technical Modifiers:** Define the CAMERA (Lens, Film Stock, Resolution).\n5. **Positive Constraints:** Define the QUALITY (e.g., \"clean background,\" \"architectural perfection,\" \"proper anatomy,\" \"perfectly formed\").\n\n**OUTPUT FORMAT:**\nReturn ONLY the enhanced prompt string, followed by a brief \"Technical Metadata\" block.\n\n**Example Input:**\n\"A photo of an old man.\"\n\n**Example Output:**\nAn extreme close-up portrait of an elderly fisherman with deep weathered skin and salt-and-pepper stubble, wearing a yellow waterproof jacket. He is standing against a dark stormy ocean background with raindrops on his face. The lighting is dramatic and side-lit, emphasizing the texture of his skin. Shot on an 85mm lens at f/1.8 with Fujifilm Superia 400 film stock, featuring high texture, raw photo quality, and visible film grain.\n\n[Technical Metadata]\nSteps: 8\nCFG: 1.0\nSampler: Euler\nSchedule: Simple\nShift: 7.0 (Crucial for skin texture)"
|
| 83 |
-
}
|
| 84 |
-
```
|
| 85 |
|
| 86 |
-
|
| 87 |
-
This is a V1. It might occasionally hallucinate or get too obsessed with "worm's-eye view." Use with a grain of salt (and maybe `Shift: 7.0`).
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen3-4B-Instruct-2507
|
| 4 |
tags:
|
| 5 |
+
- heretic
|
| 6 |
+
- abliterated
|
| 7 |
+
- uncensored
|
| 8 |
+
- prompt-engineering
|
| 9 |
+
- z-image
|
| 10 |
+
- image-generation
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Qwen3-4B Z-Image Engineer
|
| 14 |
|
| 15 |
+
A fine-tuned version of Qwen3-4B that has been **abliterated** using [Heretic](https://github.com/p-e-w/heretic) to remove refusal behaviors, then specialized for prompt engineering tasks targeting the Z-Image Turbo architecture.
|
| 16 |
|
| 17 |
+
## Usage
|
| 18 |
|
| 19 |
+
This model is designed to transform simple image prompts into detailed, high-quality prompts optimized for Z-Image Turbo (S3-DiT) image generation.
|
| 20 |
|
| 21 |
+
### System Prompt
|
|
|
|
| 22 |
|
| 23 |
+
See **`system_prompt.json`** for the recommended system prompt to use with this model. The system prompt configures the model as "Z-Engineer" - an expert prompt engineering AI.
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
### Example
|
| 26 |
|
| 27 |
+
**Input:**
|
| 28 |
+
```
|
| 29 |
+
A photo of an old man.
|
| 30 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
**Output:**
|
| 33 |
+
```
|
| 34 |
+
An extreme close-up portrait of an elderly fisherman with deep weathered skin and salt-and-pepper stubble, wearing a yellow waterproof jacket. He is standing against a dark stormy ocean background with raindrops on his face. The lighting is dramatic and side-lit, emphasizing the texture of his skin. Shot on an 85mm lens at f/1.8 with Fujifilm Superia 400 film stock, featuring high texture, raw photo quality, and visible film grain.
|
| 35 |
+
|
| 36 |
+
[Technical Metadata]
|
| 37 |
+
Steps: 8
|
| 38 |
+
CFG: 1.0
|
| 39 |
+
Sampler: Euler
|
| 40 |
+
Schedule: Simple
|
| 41 |
+
Shift: 7.0 (Crucial for skin texture)
|
| 42 |
+
```
|
| 43 |
|
| 44 |
+
## Key Features
|
| 45 |
|
| 46 |
+
The system prompt enforces several key rules for Z-Image Turbo optimization:
|
|
|
|
| 47 |
|
| 48 |
+
1. **NO Negative Prompts** - Uses "Positive Constraints" instead
|
| 49 |
+
2. **Natural Language Syntax** - Coherent sentences, not tag salad
|
| 50 |
+
3. **Texture Density** - Aggressive texture descriptions to avoid "plastic skin"
|
| 51 |
+
4. **Spatial Precision** - Specific spatial prepositions for 3D RoPE
|
| 52 |
+
5. **Text Handling** - Proper quoting and font/material descriptions
|
| 53 |
+
6. **Proper Anatomy** - Explicit "perfectly formed" descriptions
|
| 54 |
+
7. **Camera & Lens** - Always uses "shot on" / "shot with" phrasing
|
| 55 |
|
| 56 |
+
## Abliteration
|
|
|
|
| 57 |
|
| 58 |
+
This model was processed with Heretic to remove refusal behaviors, making it more compliant for creative prompt engineering tasks without unnecessary content restrictions.
|
|
|
|
| 59 |
|
| 60 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
+
Apache 2.0 (same as base model)
|
|
|