Instructions to use deepvk/llava-saiga-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepvk/llava-saiga-8b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="deepvk/llava-saiga-8b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("deepvk/llava-saiga-8b")
model = AutoModelForImageTextToText.from_pretrained("deepvk/llava-saiga-8b")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepvk/llava-saiga-8b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepvk/llava-saiga-8b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepvk/llava-saiga-8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/deepvk/llava-saiga-8b

SGLang

How to use deepvk/llava-saiga-8b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepvk/llava-saiga-8b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepvk/llava-saiga-8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepvk/llava-saiga-8b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepvk/llava-saiga-8b",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use deepvk/llava-saiga-8b with Docker Model Runner:
```
docker model run hf.co/deepvk/llava-saiga-8b
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

LLaVA-Saiga-8b

LLaVA-Saiga-8b is a Vision-Language Model (VLM) based on IlyaGusev/saiga_llama3_8b model and trained in original LLaVA setup. This model is primarily adapted to work with Russian, but still capable to work with English.

Usage

Model usage is simple via transformers API

import requests

from PIL import Image
from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration

model_name = "deepvk/llava-saiga-8b"

model = LlavaForConditionalGeneration.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
img = Image.open(requests.get(url, stream=True).raw)
messages = [
    {"role": "user", "content": "<image>\nОпиши картинку несколькими словами."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[img], text=text, return_tensors="pt")

generate_ids = model.generate(**inputs, max_new_tokens=30)
answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)

Use the <image> tag to point to an image in the text and follow the chat template for a multi-turn conversation. The model is capable of chatting without any images or working with multiple images in a conversation, but this behavior has not been tested.

The model format allows it to be directly used in popular frameworks, e.g. you can test the model using lmms-eval, see Results section for details.

Train

To train this model, we follow the original LLaVA pipeline and reuse haotian-liu/LLaVA framework.

The model was trained in two stages:

The adapter was trained using pre-training data from ShareGPT4V.
Instruction tuning included training the LLM and the adapter, for this we use:
- deepvk/LLaVA-Instruct-ru - our new dataset of VLM instructions in Russian
- deepvk/GQA-ru - the training part of the popular GQA test, translated into Russian, we used the post-prompt "Ответь одним словом. ".
- We also used instruction data from ShareGPT4V.

The entire training process took 3-4 days on 8 x A100 80GB.

Results

The model's performance was evaluated using lmms-eval framework

accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-saiga-8b" \
  --tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \
  --log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/

Model	GQA	GQA-ru	MMBench	MMBench-ru
`deepvk/llava-gemma-2b-lora`	56.39	46.37	51.72	40.19
`Intel/llava-gemma-2b`	59.80	0.20	39.40	28.30
`deepvk/llava-saiga-8b` [this model]	62.00	51.44	64.26	56.65
`llava-hf/llava-1.5-7b-hf`	61.31	28.39	62.97	52.25
`llava-hf/llava-v1.6-mistral-7b-hf`	64.65	6.65	67.70	48.80

Note: for MMBench we didn't use OpenAI API for finding quantifier in generated string. Therefore, the score is similar to Exact Match as in GQA benchmark.

Citation

@misc{liu2023llava,
  title={Visual Instruction Tuning}, 
  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  publisher={NeurIPS},
  year={2023},
}

@misc{deepvk2024llava-saiga-8b,
    title={LLaVA-Saiga-8b},
    author={Belopolskih, Daniil and Spirin, Egor},
    url={https://huggingface.co/deepvk/llava-saiga-8b},
    publisher={Hugging Face}
    year={2024},
}