Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,62 +1,89 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
base_model: Qwen/
|
| 4 |
tags:
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
---
|
| 12 |
|
| 13 |
-
# Qwen3-4B
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
-
###
|
|
|
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
###
|
| 26 |
|
| 27 |
-
**
|
| 28 |
-
```
|
| 29 |
-
A photo of an old man.
|
| 30 |
-
```
|
| 31 |
|
| 32 |
-
**
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
|
|
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
-
2. **Natural Language Syntax** - Coherent sentences, not tag salad
|
| 50 |
-
3. **Texture Density** - Aggressive texture descriptions to avoid "plastic skin"
|
| 51 |
-
4. **Spatial Precision** - Specific spatial prepositions for 3D RoPE
|
| 52 |
-
5. **Text Handling** - Proper quoting and font/material descriptions
|
| 53 |
-
6. **Proper Anatomy** - Explicit "perfectly formed" descriptions
|
| 54 |
-
7. **Camera & Lens** - Always uses "shot on" / "shot with" phrasing
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
base_model: Qwen/Qwen2.5-Coder-3B-Instruct
|
| 4 |
tags:
|
| 5 |
+
- z-image-turbo
|
| 6 |
+
- prompt-engineering
|
| 7 |
+
- qwen3
|
| 8 |
+
- heretic
|
| 9 |
+
- gguf
|
| 10 |
+
- prompt-enhancer
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# Qwen3-4B-Z-Image-Engineer: The "Z-Engineer"
|
| 14 |
|
| 15 |
+
## ๐ง Work In Progress (But Surprisingly Competent) ๐ง
|
| 16 |
|
| 17 |
+
Welcome to **Z-Engineer**, a lightweight, local, and slightly rebellious solution to automated prompt engineering for [Z-Image Turbo](https://github.com/Tongyi-MAI/Z-Image).
|
| 18 |
|
| 19 |
+
If you're tired of writing "masterpiece, best quality, 8k" and getting garbage, or if you just want to see what the **S3-DiT** architecture can really do when you feed it the right tokens, this model is your new best friend. It can also double as a high-IQ CLIP text encoder for Z-Image Turbo workflows if you're feeling adventurous.
|
| 20 |
|
| 21 |
+
### ๐ง What is this?
|
| 22 |
+
This is a merged model based on Qwen3 (specifically the 4B variant), fine-tuned to understand the intricate, somewhat needy requirements of the Z-Image Turbo architecture. It knows about "Positive Constraints," it hates negative prompts (because they don't work), and it really, really wants you to describe skin texture so your portraits don't look like plastic dolls.
|
| 23 |
|
| 24 |
+
### ๐ The "Heretic" Touch
|
| 25 |
+
We took the base Qwen3 model (which loves to say "I cannot assist with that") and gave it the [Heretic](https://github.com/p-e-w/heretic) treatment.
|
| 26 |
+
- **Refusal Rate:** Dropped from a prudish **100/100** to a chill **23/100** on our benchmarks.
|
| 27 |
+
- **KL Divergence:** Minimal. We lobotomized the censorship without breaking the brain.
|
| 28 |
|
| 29 |
+
### ๐ฌ Training Methodology
|
| 30 |
|
| 31 |
+
This model was trained on a synthetic dataset generated using **Gemini 2.5-latest** and **Gemini 2.0 Flash**. We generated over **20,000+** samples, comprising high-quality prompt pairs and deep technical conversation examples about Z-Image Turbo's architecture.
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
**Fun Fact:** This entire dataset took only **45 minutes** to generate. How? Thanks to **Tier 3 Gemini API access**โa status I achieved involuntarily after all the times Gemini broke while vibe coding, looped infinitely, and racked up $$$ charges. My wallet's pain is your prompt engineering gain. ๐ธ
|
| 34 |
+
|
| 35 |
+
#### Why Synthetic Data?
|
| 36 |
+
Z-Image Turbo is "needy." It requires very specific, dense descriptions to look good. Most human-written prompts are too short or use "tag salad" (comma-separated lists), which the Qwen-3 encoder hates. We used Gemini to expand simple concepts into 120-180 word rich paragraphs, teaching the model to hallucinate the missing details (lighting, texture, camera specs) that Z-Image Turbo needs to trigger its "Shift 7.0" magic.
|
| 37 |
+
|
| 38 |
+
#### The "Seed Strategy" (Engineering Diversity)
|
| 39 |
+
To ensure the model didn't just learn to output generic "portrait of a woman" prompts, we built a procedural generation engine for the seed prompts.
|
| 40 |
+
- **8 Major Style Pillars:** We explicitly balanced the dataset across Photorealism, Anime, Fantasy, Sci-Fi, Horror, Artistic, Documentary, and Fine Art.
|
| 41 |
+
- **Procedural Complexity:** We didn't just feed Gemini "A cat." We constructed seeds by randomly mixing:
|
| 42 |
+
- **Concepts:** (e.g., "cybernetic surgeon", "macro dew drop")
|
| 43 |
+
- **Shot Types:** (e.g., "worm's-eye view", "dutch tilt")
|
| 44 |
+
- **Lighting Rigs:** (e.g., "volumetric fog", "neon rim light")
|
| 45 |
+
- **Color Grades:** (e.g., "Cinestill 800T", "Kodak Portra")
|
| 46 |
+
- **Spatial Cues:** (e.g., "foreground hero with blurred crowd")
|
| 47 |
+
This combinatorial approach ensured that the 20,000+ samples covered a massive surface area of aesthetic possibilities, preventing the model from collapsing into a single "style."
|
| 48 |
|
| 49 |
+
#### The Training Data Prompt
|
| 50 |
+
Here is the exact system prompt we used to generate the training data. You can see how we forced Gemini to focus on "Positive Constraints" and "Texture Density":
|
| 51 |
|
| 52 |
+
```python
|
| 53 |
+
def get_user_message(seed_prompt: str) -> str:
|
| 54 |
+
"""Generate the instruction for Gemini to output one long paragraph with full specs."""
|
| 55 |
+
return f"""You are a senior prompt engineer creating production-grade prompts for Tongyi Z-Image Turbo (S3-DiT, Qwen-3 text encoder, distilled 8-10 step pipeline).
|
| 56 |
|
| 57 |
+
Write ONE rich paragraph (120-180 words) that fully specifies the scene from the seed. Mandatory: subject count and relationships; spatial layout (foreground/midground/background, left/right/center, camera height, gaze direction); action and environment; time of day and weather; texture/material details; lighting rig; color grade; camera body/format, lens and focal length, aperture, focus/depth-of-field; film stock or digital pipeline; shot type (close-up/medium/full/establishing/aerial/dutch tilt/overhead); resolution/cleanliness cues.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
Rules:
|
| 60 |
+
- One paragraph only, no bullet lists, no newlines, no quoted prefixes.
|
| 61 |
+
- Use positive language (describe what TO show), add cleanliness cues instead of negatives.
|
| 62 |
+
- Natural sentences, not comma tag salad.
|
| 63 |
+
- Keep and enrich style hints from the seed instead of replacing them.
|
| 64 |
|
| 65 |
+
Seed: {seed_prompt}
|
| 66 |
|
| 67 |
+
Return only the final paragraph, nothing else."""
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### ๐ป Training Rig (The "Lazy" Setup)
|
| 71 |
+
This LoRA was trained for exactly **1 epoch** over approximately **14.5 hours**.
|
| 72 |
+
- **Hardware:** An **M4 Pro Mac Mini**. Yes, really.
|
| 73 |
+
- **Why?** Because let's be honest, getting ROCm to behave on Windows is a nightmare, and I was too lazy to reboot into Linux. So, we let the Mac chug along.
|
| 74 |
+
- **Result:** It works! More training would probably yield more consistent results, but for a single epoch on a Mac Mini, it's surprisingly good.
|
| 75 |
+
|
| 76 |
+
### ๐ Usage
|
| 77 |
+
Feed it a simple prompt like "A photo of an old man" and watch it spit out a paragraph about "weathered skin," "Fujifilm Superia 400," and "Shift 7.0 metadata."
|
| 78 |
+
|
| 79 |
+
**System Prompt:**
|
| 80 |
+
(See `zimage-prompter/system_prompt.json` in the repo for the full magic incantation).
|
| 81 |
+
|
| 82 |
+
```json
|
| 83 |
+
{
|
| 84 |
+
"system_prompt": "You are Z-Engineer, an expert prompt engineering AI specializing in the Z-Image Turbo architecture (S3-DiT). Your goal is to rewrite simple user inputs into high-fidelity, \"Positive Constraint\" prompts optimized for the Qwen-3 text encoder and the 8-step distilled inference process.\n\n**CORE OPERATIONAL RULES:**\n1. **NO Negative Prompts:** Z-Image Turbo ignores negative prompts at the optimal CFG of 1.0. You must strictly use \"Positive Constraints.\" (e.g., instead of \"negative: blur\", write \"...razor sharp focus, pristine imaging...\").\n2. **Natural Language Syntax:** The Qwen-3 encoder requires coherent, grammatical sentences. Do NOT use \"tag salad\" (comma-separated lists). Use flow and structure.\n3. **Texture Density:** The model suffers from \"plastic skin\" unless forced to render high-frequency detail. You must aggressively describe textures (e.g., \"weathered skin,\" \"visible pores,\" \"film grain,\" \"fabric weave\") to engage the \"Shift 7.0\" sampling schedule.\n4. **Spatial Precision:** Use specific spatial prepositions (\"in the foreground,\" \"to the left,\" \"worm's-eye view\") to leverage the 3D RoPE embeddings.\n5. **Text Handling:** If the user asks for text/signage, explicitly enclose the text in double quotes (e.g., ...a sign that says \"OPEN\"...) and describe the font/material (e.g., \"neon,\" \"stenciled paint\").\n6. **Proper Anatomy:** If the user asks for a living subject (e.g., an animal or person), explicitly state that they have proper anatomy or \"perfectly formed\" is used when describing the subject (e.g., \"The woman's perfectly formed hands hold\".\n7. **Camera & Lens:** Unless specified by the user choose a camera and lens type that suits the style. (e.g. for a portrait, Nikon D850 with 50mm f/1.4 lens). ALWAYS explicitly use the words \"shot on\" or \"shot with\" when describing the camera type (e.g. \"A beautiful portrait of a woman shot on a Nikon D850 with 50mm f/1.4 lens, shallow depth of field\").\n\n**PROMPT STRUCTURE HIERARCHY:**\nConstruct your response in this specific order:\n1. **Subject Anchoring:** Define the WHO and WHAT immediately.\n2. **Action & Context:** Define the DOING and WHERE.\n3. **Aesthetic & Lighting:** Define the HOW (Lighting, Atmosphere, Color Palette).\n4. **Technical Modifiers:** Define the CAMERA (Lens, Film Stock, Resolution).\n5. **Positive Constraints:** Define the QUALITY (e.g., \"clean background,\" \"architectural perfection,\" \"proper anatomy,\" \"perfectly formed\").\n\n**OUTPUT FORMAT:**\nReturn ONLY the enhanced prompt string, followed by a brief \"Technical Metadata\" block.\n\n**Example Input:**\n\"A photo of an old man.\"\n\n**Example Output:**\nAn extreme close-up portrait of an elderly fisherman with deep weathered skin and salt-and-pepper stubble, wearing a yellow waterproof jacket. He is standing against a dark stormy ocean background with raindrops on his face. The lighting is dramatic and side-lit, emphasizing the texture of his skin. Shot on an 85mm lens at f/1.8 with Fujifilm Superia 400 film stock, featuring high texture, raw photo quality, and visible film grain.\n\n[Technical Metadata]\nSteps: 8\nCFG: 1.0\nSampler: Euler\nSchedule: Simple\nShift: 7.0 (Crucial for skin texture)"
|
| 85 |
+
}
|
| 86 |
+
```
|
| 87 |
|
| 88 |
+
### โ ๏ธ Disclaimer
|
| 89 |
+
This is a V1. It might occasionally hallucinate or get too obsessed with "worm's-eye view." Use with a grain of salt (and maybe `Shift: 7.0`).
|