Instructions to use Intel/gpt-j-6B-int8-static-inc with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Intel/gpt-j-6B-int8-static-inc with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Intel/gpt-j-6B-int8-static-inc")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Intel/gpt-j-6B-int8-static-inc")
model = AutoModelForCausalLM.from_pretrained("Intel/gpt-j-6B-int8-static-inc")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Intel/gpt-j-6B-int8-static-inc with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Intel/gpt-j-6B-int8-static-inc"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/gpt-j-6B-int8-static-inc",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Intel/gpt-j-6B-int8-static-inc

SGLang

How to use Intel/gpt-j-6B-int8-static-inc with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Intel/gpt-j-6B-int8-static-inc" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/gpt-j-6B-int8-static-inc",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Intel/gpt-j-6B-int8-static-inc" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Intel/gpt-j-6B-int8-static-inc",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Intel/gpt-j-6B-int8-static-inc with Docker Model Runner:
```
docker model run hf.co/Intel/gpt-j-6B-int8-static-inc
```

Model Details: INT8 GPT-J 6B

GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.

This int8 ONNX model is generated by neural-compressor and the fp32 model can be exported with below command:

python -m transformers.onnx --model=EleutherAI/gpt-j-6B onnx_gptj/ --framework pt --opset 13 --feature=causal-lm-with-past

Model Detail	Description
Model Authors - Company	Intel
Date	April 10, 2022
Version	1
Type	Text Generation
Paper or Other Resources	-
License	Apache 2.0
Questions or Comments	Community Tab

Intended Use	Description
Primary intended uses	You can use the raw model for text generation inference
Primary intended users	Anyone doing text generation inference
Out-of-scope uses	This model in most cases will need to be fine-tuned for your particular task. The model should not be used to intentionally create hostile or alienating environments for people.

How to use

Download the model and script by cloning the repository:

git clone https://huggingface.co/Intel/gpt-j-6B-int8-static

Then you can do inference based on the model and script 'evaluation.ipynb'.

Metrics (Model Performance):

Model	Model Size (GB)	Lambada Acc
FP32	23	0.7954
INT8	6	0.7944

Downloads last month: 29

Dataset used to train Intel/gpt-j-6B-int8-static-inc

Collection including Intel/gpt-j-6B-int8-static-inc

GPT

Collection

Series of GPT fine-tuned models • 5 items • Updated Mar 2 • 1