Instructions to use GSAI-ML/LLaDA-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use GSAI-ML/LLaDA-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use GSAI-ML/LLaDA-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "GSAI-ML/LLaDA-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GSAI-ML/LLaDA-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/GSAI-ML/LLaDA-8B-Instruct

SGLang

How to use GSAI-ML/LLaDA-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "GSAI-ML/LLaDA-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GSAI-ML/LLaDA-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "GSAI-ML/LLaDA-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "GSAI-ML/LLaDA-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use GSAI-ML/LLaDA-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/GSAI-ML/LLaDA-8B-Instruct
```

Anybody has been able to run their chat.py model on a Mac?

by neodymion - opened Feb 28, 2025

Discussion

neodymion

Feb 28, 2025

•

edited Feb 28, 2025

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

21world

Mar 1, 2025

•

edited Mar 1, 2025

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

neodymion

Mar 1, 2025

Linux Fedora / cpu .and works on 1 thread, dual xeon never answered /5-10 min waits/

Okay thanks, so it's the threading. Thanks for the reply.

neodymion

Mar 1, 2025

Changing threads to 1 did not help. 30 minutes wait, still no output.

import os
import torch

Set single thread environment variables

os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["TOKENIZERS_PARALLELISM"] = "false"

Configure PyTorch thread settings

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

Check for MPS availability (macOS 12.3+ and PyTorch 1.12+ required)

device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
# Load the model in bfloat16 on CPU first to avoid MPS dtype issues
model = AutoModel.from_pretrained(
'GSAI-ML/LLaDA-8B-Instruct',
trust_remote_code=True,
torch_dtype=torch.bfloat16 # Load weights in bfloat16
).to('cpu').eval()

21world

Mar 1, 2025

:)
yes :))))

21world

Mar 1, 2025

cpu cores x 2 are correct threads ,for example 24 cores x 2 = 48 threads
with 1 thread only 1/48 ,2%-4% cpu load

21world

Mar 1, 2025

when i run chat.py it use only one thread,not the max threads available
will try your code later

spawn99

Mar 1, 2025

Thanks for uploading. But I am struggling to get the chat.py to run on a M2 Pro 32GB

It won't run with AppleSilicon MPS due to it using bfloat16. I tried changing that to float32 but then it did not run. :D
Now with CPU it is running, but takes ages to reply. All I entered was "hi".
Is this model not supposed to be faster? Anything I need to change?

modifications to the chat.py
from generate import generate
from transformers import AutoTokenizer, AutoModel

def chat():
device = 'cpu' ##<-- force cpu use
model = AutoModel.from_pretrained('GSAI-ML/LLaDA-8B-Instruct', trust_remote_code=True, torch_dtype=torch.bfloat16).to(device).eval()

i'm working on mlx, standby

nieshen

GSAI-ML org Mar 4, 2025

I'm extremely sorry. I'm not very familiar with running our code on MAC and I'm eagerly looking forward to more help from the community!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment