wangkanai
/

qwen3-vl-4b-thinking

@@ -7,15 +7,17 @@ tags:
   - multimodal
   - qwen
   - abliterated
   - gguf
   - quantized
 ---
-<!-- README Version: v1.0 -->
-# Qwen3-VL-4B-Thinking-Abliterated (GGUF)
-Quantized GGUF version of the Qwen3-VL-4B-Thinking vision-language model with content filtering restrictions removed. This model maintains the powerful multimodal reasoning and visual understanding capabilities of the base model while allowing unrestricted factual and descriptive outputs.
 ## Model Description
@@ -40,33 +42,111 @@ Qwen3-VL-4B-Thinking is a compact 4-billion parameter vision-language model deve
 ```
 E:\huggingface\qwen3-vl-4b-thinking\
-└── qwen3-vl-4b-thinking-abliterated.gguf    7.5 GB (quantized model)
 ```
-**Total Repository Size:** 7.5 GB
 ## Hardware Requirements
-### Minimum Requirements
 - **VRAM:** 8 GB (for inference with quantization)
 - **RAM:** 16 GB system memory
 - **Disk Space:** 8 GB available storage
 - **GPU:** CUDA-compatible GPU (NVIDIA) or Metal (Apple Silicon)
-### Recommended Requirements
 - **VRAM:** 12+ GB for optimal performance
 - **RAM:** 32 GB system memory for large contexts
 - **GPU:** RTX 3060 12GB / RTX 4060 Ti 16GB or better
 - **Apple Silicon:** M1 Pro/Max, M2 Pro/Max, M3 series
 ### Performance Notes
-- Quantized GGUF format provides 30-40% memory reduction vs full precision
 - Supports multi-image and video processing with Flash Attention 2
 - Context length up to 256K tokens (1M experimental)
 ## Usage Examples
-### Using llama.cpp
 ```bash
 # Load model with llama.cpp
@@ -123,9 +203,60 @@ print(response['choices'][0]['message']['content'])
    - Top-k: 20
    - Context Length: 8192-32768
-### Using with Transformers (Converting GGUF to Safetensors)
-Note: GGUF models are optimized for llama.cpp-based inference. For Transformers usage, consider downloading the original safetensors format from Hugging Face Hub.
 ## Model Specifications
@@ -133,36 +264,55 @@ Note: GGUF models are optimized for llama.cpp-based inference. For Transformers
 |---------------|---------|
 | **Architecture** | Vision-Language Transformer |
 | **Parameters** | ~4.44 billion |
-| **Format** | GGUF (quantized) |
-| **Quantization** | Variable (optimized for efficiency) |
 | **Base Model** | Qwen3-VL-4B-Thinking |
 | **Modification** | Abliterated (content filtering removed) |
 | **Context Length** | 256K tokens (1M experimental) |
 | **Languages** | Multilingual (32 languages for OCR) |
 | **Positional Encoding** | Interleaved-MRoPE |
 | **Attention Mechanism** | Grouped Query Attention (GQA) |
 ## Performance Tips and Optimization
 ### Memory Optimization
 - Use GPU offloading (`n_gpu_layers`) to balance VRAM vs RAM usage
 - For limited VRAM, reduce `n_gpu_layers` to offload layers to system RAM
 - Adjust context size based on task requirements (smaller contexts = faster inference)
 ### Inference Speed
-- Enable Flash Attention 2 for multi-image/video tasks (if supported)
 - Use Metal acceleration on Apple Silicon devices
-- Consider batch processing for multiple images
 ### Quality Settings
 - **Temperature 1.0**: Balanced creativity and coherence (recommended)
 - **Top-p 0.95**: Nucleus sampling for diverse outputs
 - **Top-k 20**: Limits token selection for stability
 ### Context Management
 - Start with 8K context for most tasks
 - Scale up to 32K-128K for long documents or videos
-- Experimental 256K-1M contexts require significant VRAM
 ## Abliteration Notice

   - multimodal
   - qwen
   - abliterated
+  - safetensors
   - gguf
   - quantized
+  - bfloat16
 ---
+<!-- README Version: v1.1 -->
+# Qwen3-VL-4B-Thinking-Abliterated
+Multi-format distribution of the Qwen3-VL-4B-Thinking vision-language model with content filtering restrictions removed. Available in both safetensors (full precision) and GGUF (quantized) formats. This model maintains the powerful multimodal reasoning and visual understanding capabilities of the base model while allowing unrestricted factual and descriptive outputs.
 ## Model Description
 ```
 E:\huggingface\qwen3-vl-4b-thinking\
+├── qwen3-vl-4b-thinking-abliterated.safetensors    8.3 GB (full precision model)
+└── qwen3-vl-4b-thinking-abliterated.gguf           7.5 GB (quantized model)
 ```
+**Total Repository Size:** 15.8 GB (excluding metadata)
 ## Hardware Requirements
+### For GGUF Format (Quantized)
+**Minimum Requirements:**
 - **VRAM:** 8 GB (for inference with quantization)
 - **RAM:** 16 GB system memory
 - **Disk Space:** 8 GB available storage
 - **GPU:** CUDA-compatible GPU (NVIDIA) or Metal (Apple Silicon)
+**Recommended Requirements:**
 - **VRAM:** 12+ GB for optimal performance
 - **RAM:** 32 GB system memory for large contexts
 - **GPU:** RTX 3060 12GB / RTX 4060 Ti 16GB or better
 - **Apple Silicon:** M1 Pro/Max, M2 Pro/Max, M3 series
+### For Safetensors Format (Full Precision)
+**Minimum Requirements:**
+- **VRAM:** 12 GB (for full precision inference)
+- **RAM:** 16 GB system memory
+- **Disk Space:** 9 GB available storage
+- **GPU:** CUDA-compatible GPU with compute capability 7.0+
+**Recommended Requirements:**
+- **VRAM:** 16+ GB for optimal performance and larger contexts
+- **RAM:** 32 GB system memory
+- **GPU:** RTX 3090 24GB / RTX 4090 24GB / A6000 or better
+- **Apple Silicon:** M2 Ultra, M3 Max/Ultra with 48GB+ unified memory
 ### Performance Notes
+- **GGUF format** provides 30-40% memory reduction vs full precision
+- **Safetensors format** offers maximum quality and compatibility with Transformers
 - Supports multi-image and video processing with Flash Attention 2
 - Context length up to 256K tokens (1M experimental)
 ## Usage Examples
+### Using Transformers with Safetensors (Full Precision)
+```python
+from transformers import AutoModelForVision2Seq, AutoProcessor
+from PIL import Image
+import torch
+# Load model and processor
+model_path = "E:/huggingface/qwen3-vl-4b-thinking"
+model = AutoModelForVision2Seq.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(
+    model_path,
+    trust_remote_code=True
+)
+# Prepare image and text input
+image = Image.open("path/to/your/image.jpg")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": "Describe this image in detail."}
+        ]
+    }
+]
+# Process and generate
+text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
+inputs = inputs.to(model.device)
+# Generate response
+with torch.inference_mode():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=2048,
+        temperature=1.0,
+        top_p=0.95,
+        top_k=20,
+        do_sample=True
+    )
+# Decode output
+generated_ids = [
+    output_ids[len(input_ids):]
+    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
+]
+response = processor.batch_decode(
+    generated_ids,
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False
+)[0]
+print(response)
+```
+### Using llama.cpp with GGUF (Quantized)
 ```bash
 # Load model with llama.cpp
    - Top-k: 20
    - Context Length: 8192-32768
+### Video Understanding Example (Safetensors)
+```python
+from transformers import AutoModelForVision2Seq, AutoProcessor
+import torch
+import cv2
+# Load model
+model_path = "E:/huggingface/qwen3-vl-4b-thinking"
+model = AutoModelForVision2Seq.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
+# Load video frames (sample every N frames)
+video_path = "path/to/video.mp4"
+cap = cv2.VideoCapture(video_path)
+frames = []
+frame_count = 0
+sample_rate = 30  # Sample every 30th frame
+while cap.isOpened():
+    ret, frame = cap.read()
+    if not ret:
+        break
+    if frame_count % sample_rate == 0:
+        frames.append(frame)
+    frame_count += 1
+cap.release()
+# Process video with temporal understanding
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "video", "video": frames},
+            {"type": "text", "text": "What events occur in this video? Provide timestamps."}
+        ]
+    }
+]
+# Generate with temporal event localization
+inputs = processor(messages=messages, return_tensors="pt")
+inputs = inputs.to(model.device)
+with torch.inference_mode():
+    output = model.generate(**inputs, max_new_tokens=2048)
+response = processor.decode(output[0], skip_special_tokens=True)
+print(response)
+```
 ## Model Specifications
 |---------------|---------|
 | **Architecture** | Vision-Language Transformer |
 | **Parameters** | ~4.44 billion |
+| **Formats Available** | Safetensors (BF16/FP16), GGUF (quantized) |
+| **Safetensors Precision** | BFloat16 (full precision) |
+| **GGUF Quantization** | Variable (optimized for efficiency) |
 | **Base Model** | Qwen3-VL-4B-Thinking |
 | **Modification** | Abliterated (content filtering removed) |
 | **Context Length** | 256K tokens (1M experimental) |
 | **Languages** | Multilingual (32 languages for OCR) |
 | **Positional Encoding** | Interleaved-MRoPE |
 | **Attention Mechanism** | Grouped Query Attention (GQA) |
+| **Vision Encoder** | ViT with DeepStack fusion |
+| **Supported Modalities** | Images, Videos, Text |
 ## Performance Tips and Optimization
+### Format Selection
+- **Safetensors (BF16)**: Best quality, maximum compatibility with Transformers, requires more VRAM
+- **GGUF (Quantized)**: Lower VRAM usage, faster inference, llama.cpp ecosystem compatibility
+- Choose based on your hardware constraints and quality requirements
 ### Memory Optimization
+**For Safetensors:**
+- Use `torch_dtype=torch.bfloat16` for optimal memory/quality balance
+- Enable `device_map="auto"` for automatic multi-GPU distribution
+- Use gradient checkpointing for training/fine-tuning: `model.gradient_checkpointing_enable()`
+- Reduce batch size if encountering OOM errors
+**For GGUF:**
 - Use GPU offloading (`n_gpu_layers`) to balance VRAM vs RAM usage
 - For limited VRAM, reduce `n_gpu_layers` to offload layers to system RAM
 - Adjust context size based on task requirements (smaller contexts = faster inference)
 ### Inference Speed
+- Enable Flash Attention 2 for multi-image/video tasks (install with `pip install flash-attn`)
 - Use Metal acceleration on Apple Silicon devices
+- Consider batch processing for multiple images (Safetensors supports batching)
+- Use BetterTransformer for optimized inference: `model = model.to_bettertransformer()`
 ### Quality Settings
 - **Temperature 1.0**: Balanced creativity and coherence (recommended)
 - **Top-p 0.95**: Nucleus sampling for diverse outputs
 - **Top-k 20**: Limits token selection for stability
+- **max_new_tokens**: 512-2048 for descriptions, 4096+ for detailed analysis
 ### Context Management
 - Start with 8K context for most tasks
 - Scale up to 32K-128K for long documents or videos
+- Experimental 256K-1M contexts require significant VRAM (24GB+)
+- Use RoPE scaling for extended context: configure in model config
 ## Abliteration Notice

qwen3-vl-4b-thinking-abliterated.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d04ddac4854a122052efdad8d0131c9150a545830831990f9cfa3505d3bd914
+size 8875719408