wangkanai commited on
Commit
842e8c5
·
verified ·
1 Parent(s): 14569e9

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -7,15 +7,17 @@ tags:
7
  - multimodal
8
  - qwen
9
  - abliterated
 
10
  - gguf
11
  - quantized
 
12
  ---
13
 
14
- <!-- README Version: v1.0 -->
15
 
16
- # Qwen3-VL-4B-Thinking-Abliterated (GGUF)
17
 
18
- Quantized GGUF version of the Qwen3-VL-4B-Thinking vision-language model with content filtering restrictions removed. This model maintains the powerful multimodal reasoning and visual understanding capabilities of the base model while allowing unrestricted factual and descriptive outputs.
19
 
20
  ## Model Description
21
 
@@ -40,33 +42,111 @@ Qwen3-VL-4B-Thinking is a compact 4-billion parameter vision-language model deve
40
 
41
  ```
42
  E:\huggingface\qwen3-vl-4b-thinking\
43
- ── qwen3-vl-4b-thinking-abliterated.gguf 7.5 GB (quantized model)
 
44
  ```
45
 
46
- **Total Repository Size:** 7.5 GB
47
 
48
  ## Hardware Requirements
49
 
50
- ### Minimum Requirements
 
51
  - **VRAM:** 8 GB (for inference with quantization)
52
  - **RAM:** 16 GB system memory
53
  - **Disk Space:** 8 GB available storage
54
  - **GPU:** CUDA-compatible GPU (NVIDIA) or Metal (Apple Silicon)
55
 
56
- ### Recommended Requirements
57
  - **VRAM:** 12+ GB for optimal performance
58
  - **RAM:** 32 GB system memory for large contexts
59
  - **GPU:** RTX 3060 12GB / RTX 4060 Ti 16GB or better
60
  - **Apple Silicon:** M1 Pro/Max, M2 Pro/Max, M3 series
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ### Performance Notes
63
- - Quantized GGUF format provides 30-40% memory reduction vs full precision
 
64
  - Supports multi-image and video processing with Flash Attention 2
65
  - Context length up to 256K tokens (1M experimental)
66
 
67
  ## Usage Examples
68
 
69
- ### Using llama.cpp
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
 
71
  ```bash
72
  # Load model with llama.cpp
@@ -123,9 +203,60 @@ print(response['choices'][0]['message']['content'])
123
  - Top-k: 20
124
  - Context Length: 8192-32768
125
 
126
- ### Using with Transformers (Converting GGUF to Safetensors)
127
 
128
- Note: GGUF models are optimized for llama.cpp-based inference. For Transformers usage, consider downloading the original safetensors format from Hugging Face Hub.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
 
130
  ## Model Specifications
131
 
@@ -133,36 +264,55 @@ Note: GGUF models are optimized for llama.cpp-based inference. For Transformers
133
  |---------------|---------|
134
  | **Architecture** | Vision-Language Transformer |
135
  | **Parameters** | ~4.44 billion |
136
- | **Format** | GGUF (quantized) |
137
- | **Quantization** | Variable (optimized for efficiency) |
 
138
  | **Base Model** | Qwen3-VL-4B-Thinking |
139
  | **Modification** | Abliterated (content filtering removed) |
140
  | **Context Length** | 256K tokens (1M experimental) |
141
  | **Languages** | Multilingual (32 languages for OCR) |
142
  | **Positional Encoding** | Interleaved-MRoPE |
143
  | **Attention Mechanism** | Grouped Query Attention (GQA) |
 
 
144
 
145
  ## Performance Tips and Optimization
146
 
 
 
 
 
 
147
  ### Memory Optimization
 
 
 
 
 
 
 
 
148
  - Use GPU offloading (`n_gpu_layers`) to balance VRAM vs RAM usage
149
  - For limited VRAM, reduce `n_gpu_layers` to offload layers to system RAM
150
  - Adjust context size based on task requirements (smaller contexts = faster inference)
151
 
152
  ### Inference Speed
153
- - Enable Flash Attention 2 for multi-image/video tasks (if supported)
154
  - Use Metal acceleration on Apple Silicon devices
155
- - Consider batch processing for multiple images
 
156
 
157
  ### Quality Settings
158
  - **Temperature 1.0**: Balanced creativity and coherence (recommended)
159
  - **Top-p 0.95**: Nucleus sampling for diverse outputs
160
  - **Top-k 20**: Limits token selection for stability
 
161
 
162
  ### Context Management
163
  - Start with 8K context for most tasks
164
  - Scale up to 32K-128K for long documents or videos
165
- - Experimental 256K-1M contexts require significant VRAM
 
166
 
167
  ## Abliteration Notice
168
 
 
7
  - multimodal
8
  - qwen
9
  - abliterated
10
+ - safetensors
11
  - gguf
12
  - quantized
13
+ - bfloat16
14
  ---
15
 
16
+ <!-- README Version: v1.1 -->
17
 
18
+ # Qwen3-VL-4B-Thinking-Abliterated
19
 
20
+ Multi-format distribution of the Qwen3-VL-4B-Thinking vision-language model with content filtering restrictions removed. Available in both safetensors (full precision) and GGUF (quantized) formats. This model maintains the powerful multimodal reasoning and visual understanding capabilities of the base model while allowing unrestricted factual and descriptive outputs.
21
 
22
  ## Model Description
23
 
 
42
 
43
  ```
44
  E:\huggingface\qwen3-vl-4b-thinking\
45
+ ── qwen3-vl-4b-thinking-abliterated.safetensors 8.3 GB (full precision model)
46
+ └── qwen3-vl-4b-thinking-abliterated.gguf 7.5 GB (quantized model)
47
  ```
48
 
49
+ **Total Repository Size:** 15.8 GB (excluding metadata)
50
 
51
  ## Hardware Requirements
52
 
53
+ ### For GGUF Format (Quantized)
54
+ **Minimum Requirements:**
55
  - **VRAM:** 8 GB (for inference with quantization)
56
  - **RAM:** 16 GB system memory
57
  - **Disk Space:** 8 GB available storage
58
  - **GPU:** CUDA-compatible GPU (NVIDIA) or Metal (Apple Silicon)
59
 
60
+ **Recommended Requirements:**
61
  - **VRAM:** 12+ GB for optimal performance
62
  - **RAM:** 32 GB system memory for large contexts
63
  - **GPU:** RTX 3060 12GB / RTX 4060 Ti 16GB or better
64
  - **Apple Silicon:** M1 Pro/Max, M2 Pro/Max, M3 series
65
 
66
+ ### For Safetensors Format (Full Precision)
67
+ **Minimum Requirements:**
68
+ - **VRAM:** 12 GB (for full precision inference)
69
+ - **RAM:** 16 GB system memory
70
+ - **Disk Space:** 9 GB available storage
71
+ - **GPU:** CUDA-compatible GPU with compute capability 7.0+
72
+
73
+ **Recommended Requirements:**
74
+ - **VRAM:** 16+ GB for optimal performance and larger contexts
75
+ - **RAM:** 32 GB system memory
76
+ - **GPU:** RTX 3090 24GB / RTX 4090 24GB / A6000 or better
77
+ - **Apple Silicon:** M2 Ultra, M3 Max/Ultra with 48GB+ unified memory
78
+
79
  ### Performance Notes
80
+ - **GGUF format** provides 30-40% memory reduction vs full precision
81
+ - **Safetensors format** offers maximum quality and compatibility with Transformers
82
  - Supports multi-image and video processing with Flash Attention 2
83
  - Context length up to 256K tokens (1M experimental)
84
 
85
  ## Usage Examples
86
 
87
+ ### Using Transformers with Safetensors (Full Precision)
88
+
89
+ ```python
90
+ from transformers import AutoModelForVision2Seq, AutoProcessor
91
+ from PIL import Image
92
+ import torch
93
+
94
+ # Load model and processor
95
+ model_path = "E:/huggingface/qwen3-vl-4b-thinking"
96
+ model = AutoModelForVision2Seq.from_pretrained(
97
+ model_path,
98
+ torch_dtype=torch.bfloat16,
99
+ device_map="auto",
100
+ trust_remote_code=True
101
+ )
102
+ processor = AutoProcessor.from_pretrained(
103
+ model_path,
104
+ trust_remote_code=True
105
+ )
106
+
107
+ # Prepare image and text input
108
+ image = Image.open("path/to/your/image.jpg")
109
+ messages = [
110
+ {
111
+ "role": "user",
112
+ "content": [
113
+ {"type": "image", "image": image},
114
+ {"type": "text", "text": "Describe this image in detail."}
115
+ ]
116
+ }
117
+ ]
118
+
119
+ # Process and generate
120
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
121
+ inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
122
+ inputs = inputs.to(model.device)
123
+
124
+ # Generate response
125
+ with torch.inference_mode():
126
+ output_ids = model.generate(
127
+ **inputs,
128
+ max_new_tokens=2048,
129
+ temperature=1.0,
130
+ top_p=0.95,
131
+ top_k=20,
132
+ do_sample=True
133
+ )
134
+
135
+ # Decode output
136
+ generated_ids = [
137
+ output_ids[len(input_ids):]
138
+ for input_ids, output_ids in zip(inputs.input_ids, output_ids)
139
+ ]
140
+ response = processor.batch_decode(
141
+ generated_ids,
142
+ skip_special_tokens=True,
143
+ clean_up_tokenization_spaces=False
144
+ )[0]
145
+
146
+ print(response)
147
+ ```
148
+
149
+ ### Using llama.cpp with GGUF (Quantized)
150
 
151
  ```bash
152
  # Load model with llama.cpp
 
203
  - Top-k: 20
204
  - Context Length: 8192-32768
205
 
206
+ ### Video Understanding Example (Safetensors)
207
 
208
+ ```python
209
+ from transformers import AutoModelForVision2Seq, AutoProcessor
210
+ import torch
211
+ import cv2
212
+
213
+ # Load model
214
+ model_path = "E:/huggingface/qwen3-vl-4b-thinking"
215
+ model = AutoModelForVision2Seq.from_pretrained(
216
+ model_path,
217
+ torch_dtype=torch.bfloat16,
218
+ device_map="auto",
219
+ trust_remote_code=True
220
+ )
221
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
222
+
223
+ # Load video frames (sample every N frames)
224
+ video_path = "path/to/video.mp4"
225
+ cap = cv2.VideoCapture(video_path)
226
+ frames = []
227
+ frame_count = 0
228
+ sample_rate = 30 # Sample every 30th frame
229
+
230
+ while cap.isOpened():
231
+ ret, frame = cap.read()
232
+ if not ret:
233
+ break
234
+ if frame_count % sample_rate == 0:
235
+ frames.append(frame)
236
+ frame_count += 1
237
+ cap.release()
238
+
239
+ # Process video with temporal understanding
240
+ messages = [
241
+ {
242
+ "role": "user",
243
+ "content": [
244
+ {"type": "video", "video": frames},
245
+ {"type": "text", "text": "What events occur in this video? Provide timestamps."}
246
+ ]
247
+ }
248
+ ]
249
+
250
+ # Generate with temporal event localization
251
+ inputs = processor(messages=messages, return_tensors="pt")
252
+ inputs = inputs.to(model.device)
253
+
254
+ with torch.inference_mode():
255
+ output = model.generate(**inputs, max_new_tokens=2048)
256
+
257
+ response = processor.decode(output[0], skip_special_tokens=True)
258
+ print(response)
259
+ ```
260
 
261
  ## Model Specifications
262
 
 
264
  |---------------|---------|
265
  | **Architecture** | Vision-Language Transformer |
266
  | **Parameters** | ~4.44 billion |
267
+ | **Formats Available** | Safetensors (BF16/FP16), GGUF (quantized) |
268
+ | **Safetensors Precision** | BFloat16 (full precision) |
269
+ | **GGUF Quantization** | Variable (optimized for efficiency) |
270
  | **Base Model** | Qwen3-VL-4B-Thinking |
271
  | **Modification** | Abliterated (content filtering removed) |
272
  | **Context Length** | 256K tokens (1M experimental) |
273
  | **Languages** | Multilingual (32 languages for OCR) |
274
  | **Positional Encoding** | Interleaved-MRoPE |
275
  | **Attention Mechanism** | Grouped Query Attention (GQA) |
276
+ | **Vision Encoder** | ViT with DeepStack fusion |
277
+ | **Supported Modalities** | Images, Videos, Text |
278
 
279
  ## Performance Tips and Optimization
280
 
281
+ ### Format Selection
282
+ - **Safetensors (BF16)**: Best quality, maximum compatibility with Transformers, requires more VRAM
283
+ - **GGUF (Quantized)**: Lower VRAM usage, faster inference, llama.cpp ecosystem compatibility
284
+ - Choose based on your hardware constraints and quality requirements
285
+
286
  ### Memory Optimization
287
+
288
+ **For Safetensors:**
289
+ - Use `torch_dtype=torch.bfloat16` for optimal memory/quality balance
290
+ - Enable `device_map="auto"` for automatic multi-GPU distribution
291
+ - Use gradient checkpointing for training/fine-tuning: `model.gradient_checkpointing_enable()`
292
+ - Reduce batch size if encountering OOM errors
293
+
294
+ **For GGUF:**
295
  - Use GPU offloading (`n_gpu_layers`) to balance VRAM vs RAM usage
296
  - For limited VRAM, reduce `n_gpu_layers` to offload layers to system RAM
297
  - Adjust context size based on task requirements (smaller contexts = faster inference)
298
 
299
  ### Inference Speed
300
+ - Enable Flash Attention 2 for multi-image/video tasks (install with `pip install flash-attn`)
301
  - Use Metal acceleration on Apple Silicon devices
302
+ - Consider batch processing for multiple images (Safetensors supports batching)
303
+ - Use BetterTransformer for optimized inference: `model = model.to_bettertransformer()`
304
 
305
  ### Quality Settings
306
  - **Temperature 1.0**: Balanced creativity and coherence (recommended)
307
  - **Top-p 0.95**: Nucleus sampling for diverse outputs
308
  - **Top-k 20**: Limits token selection for stability
309
+ - **max_new_tokens**: 512-2048 for descriptions, 4096+ for detailed analysis
310
 
311
  ### Context Management
312
  - Start with 8K context for most tasks
313
  - Scale up to 32K-128K for long documents or videos
314
+ - Experimental 256K-1M contexts require significant VRAM (24GB+)
315
+ - Use RoPE scaling for extended context: configure in model config
316
 
317
  ## Abliteration Notice
318
 
qwen3-vl-4b-thinking-abliterated.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d04ddac4854a122052efdad8d0131c9150a545830831990f9cfa3505d3bd914
3
+ size 8875719408