vLLM is one of the most exciting LLM projects today. With over 200k monthly downloads, and a permissive Apache 2.0 License, vLLM is becoming an increasingly popular way to serve LLMs at scale.

In this tutorial, I’ll show you how you can configure and run vLLM to serve open-source LLMs in production.

Getting started with vLLM

For those new to vLLM, let’s first explain what vLLM is.

vLLM is an open-source project that allows you to do LLM inference and serving. Inference means that you can download model weights and pass them to vLLM to perform inference via their Python API; here’s an example from their documentation:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# initialize
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")

# perform the inference
outputs = llm.generate(prompts, sampling_params)

# print outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

In this regard, vLLM is similar to Hugginface’s transformers library, as a comparison, here’s how you do inference on the same model using transformers:

from transformers import pipeline

generator = pipeline('text-generation', model="facebook/opt-125m")
generator("Hello, my name is")

Running inference using the Python API, as I showed in the previous example, is fine for quick testing, but in a production setting, we want to offer a simple interface to interact with the model so other parts of the system can call it easily, a great solution is to expose our model via an API.

Let’s say you found out about vLLM, and now you want to build a REST API to serve a model, you might build a Flask app like this:

from flask import Flask, request, jsonify
from vllm import LLM, SamplingParams

app = Flask(__name__)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")

@app.route('/generate', methods=['POST'])
def generate():
    data = request.get_json()
    prompts = data.get('prompts', [])

    outputs = llm.generate(prompts, sampling_params)

    # Prepare the outputs.
    results = []

    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        results.append({
            'prompt': prompt,
            'generated_text': generated_text
        })

    return jsonify(results)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Our users can now consume our model by hitting the /generate endpoint. However, this has many limitations: if many users hit the endpoint simultaneously, Flask will attempt to run them concurrently and crash. We also need to implement our authentication mechanism. Finally, interoperability is limited; users must read our model’s REST API documentation to interact with our model.

This is where the serving part of vLLM shines since it provides all of this for us. If vLLM’s Python API is akin to the transformers library, vLLM’s server is akin to TGI.

Now that we have explained the basics of vLLM; let’s install it!

Installing vLLM

Installing vLLM is simple:

pip install vllm

Keep in mind that vLLM requires Linux and Python >=3.8. Furthermore, it requires a GPU with compute capability >=7.0 (e.g., V100, T4, RTX20xx, A100, L4, H100).

Finally, vLLM is compiled with CUDA 12.1, so you need to ensure that your machine is running such CUDA version. To check it, run:

nvcc --version

If you’re not running CUDA 12.1 you can either install a version of vLLM compiled with the CUDA version you’re running (see the installation instructions to learn more), or install CUDA 12.1.

Checking your installation

Before continuing, I’d advise you to check your installation by running some sanity checks:

# ensure torch is working with CUDA, this should print: True
python -c 'import torch; print(torch.cuda.is_available())'

Now, store the following in a check-vllm.py file:

from vllm import LLM, SamplingParams
prompts = [
    "Mexico is famous for ",
    "The largest country in the world is ",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
responses = llm.generate(prompts, sampling_params)

for response in responses:
    print(response.outputs[0].text)

And run the script:

python check-vllm.py

After the model is loaded, you’ll see some output; in my case, I got this:

~~national~~ cultural and artistic art. They've already worked with him.

~~the country~~ a capitalist system with the highest GDP per capita in the world

Starting the vLLM server

Now that we have vLLM installed, let’s start the server. The basic command is as follows:

python -m vllm.entrypoints.openai.api_server --model=MODELTORUN

Where MODELTORUN is the model you want to serve, for example, to serve google/gemma-2b.

python -m vllm.entrypoints.openai.api_server --model=google/gemma-2b

Note that some models, such as google/gemma-2b require you to accept their license, hence, you need to create a HuggingFace account, accept the model’s license, and generate a token.

For example, when opening google/gemma-2b on HugginFace (you need to be logged in), you’ll see this:

accept gemma license

Once you accept the license, head over to the tokens section, and grab a token, then, before starting vLLM, set the token as follows:

export HF_TOKEN=YOURTOKEN

Once the token is set, you can start the server.

python -m vllm.entrypoints.openai.api_server --model=google/gemma-2b

Note that the token is required even if you downloaded the weights. Otherwise you’ll get the following error:

  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_file_system.py", line 863, in _raise_file_not_found
    raise FileNotFoundError(msg) from err
FileNotFoundError: google/gemma-2b (repository not found)

Setting the dtype

One important setting to consider is dtype, which controls the data type for the model weights. You might need to tweak this parameter depending on your GPU, for example, trying to run google/gemma-2b:

# --dtype=auto is the default value
python -m vllm.entrypoints.openai.api_server --model=google/gemma-2b --dtype=auto

On an NVIDIA Tesla T4 yields the following error:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0.
Your Tesla T4 GPU has compute capability 7.5. You can use float16 instead by explicitly
setting the`dtype` flag in CLI, for example: --dtype=half.

Changing the --dtype flag allows us to run the model on a T4:

python -m vllm.entrypoints.openai.api_server --model=google/gemma-2b --dtype=half

If this is the first time you start vLLM with the passed --model, it’ll take a few minutes since it has to download the weights. Further initializations will be faster since weights are stored in the ~/.cache directory; however, since the model has to load into memory, it’ll still take some time to load (depending on the model size).

If you see a message like this:

INFO:     Started server process [428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:80 (Press CTRL+C to quit)

vLLM is ready to accept requests!

Making requests

Once the server is running, you can make requests; here’s an example using google/gemma-2b and the Python requests library:

# remember to run: pip install requests
import requests
import json

# change for your host
VLLM_HOST = "https://autumn-snow-1380.ploomberapp.io"
url = f"{VLLM_HOST}/v1/completions"

headers = {"Content-Type": "application/json"}
data = {
    "model": "google/gemma-2b",
    "prompt": "JupySQL is",
    "max_tokens": 100,
    "temperature": 0
}

response = requests.post(url, headers=headers, data=json.dumps(data))

print(response.json()["choices"][0]["text"])

This is the response that I got:

JupySQL is a Python library that allows you to create and run SQL queries in Jupyter notebooks. It is a powerful tool for data analysis and visualization, and can be used to explore and manipulate large datasets.

How does JupySQL work?

JupySQL works by connecting to a database server and executing SQL queries. It supports a wide range of databases, including MySQL, PostgreSQL, and SQLite.

Once you have connected to a database, you can create and run SQL queries in

Accurate!

Using the OpenAI client

vLLM exposes an API that mimics OpenAI’s one; which implies that you can use OpenAI’s Python package but direct calls to your vLLM server. Let’s see an example:

# NOTE: remember to run: pip install openai
from openai import OpenAI

# we haven't configured authentication, we pass a dummy value
openai_api_key = "EMPTY"
# modify this value to match your host, remember to add /v1 at the end
openai_api_base = "https://autumn-snow-1380.ploomberapp.io/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="google/gemma-2b",
                                      prompt="JupySQL is",
                                      max_tokens=20)
print(completion.choices[0].text)

I got the following output:

a powerful SQL editor and IDE. It integrates with Google Jupyter Notebook,
which allows users to create and

Using the chat API

The previous example used the completions API; but you might be more familiar with the chat API. Note that if you use the chat API, you must ensure that you use an instruction-tuned model. google/gemma-2b is not tuned for instructions; let’s instead use google/gemma-2b-it, let’s start our vLLM server to use such model:

python -m vllm.entrypoints.openai.api_server \
    --host 0.0.0.0 --port 80 \
    --model google/gemma-2b \
    --dtype=half

Now we can use the client.chat.completions.create function:

# NOTE: remember to run: pip install openai
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "https://autumn-snow-1380.ploomberapp.io/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="google/gemma-2b-it",
    messages=[
        {"role": "user", "content": "Tell me in one sentence what Mexico is famous for"},
    ]
)
print(chat_response.choices[0].message.content)

Output:

Mexico is known for its rich culture, vibrant cities, stunning natural beauty,
and delicious cuisine.

Sounds accurate!

If you’ve used OpenAI’s API before, you might remember that the messages argument usually contains some messages with {"role": "system", "content": ...}:

chat_response = client.chat.completions.create(
    model="google/gemma-2b-it",
    messages=[
        {"role": "system", "content": "You're a helful assistant."},
        {"role": "user", "content": "Tell me in one sentence what Mexico is famous for"},
    ]

However, some models do not support the system role, for example, google/gemma-2b-it returns the following:

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'System role not
supported', 'type': 'BadRequestError', 'param': None, 'code': 400}

Check your model’s documentation to know how to use the chat API.

Security settings

By default, your server won’t have any authentication. If you’re planning to expose your server to the internet, ensure you set an API key; you can generate one as follows:

export VLLM_API_KEY=$(python -c 'import secrets; print(secrets.token_urlsafe())')
# print the API key
echo $VLLM_API_KEY

And start vLLM:

python -m vllm.entrypoints.openai.api_server --model google/gemma-2b-it --dtype=half

Now, our server will be protected, and all requests that don’t have the API key will be rejected. Note that in the previous command, we did not pass --api-key because vLLM will automatically read the VLLM_API_KEY environment variable.

Test that your server has API key authentication by making a call using any of the earlier Python snippets, you’ll see the following error:

No key: `AuthenticationError: Error code: 401 - {'error': 'Unauthorized'}`

To fix this, initialize the OpenAI client with the correct API key:

from openai import OpenAI

openai_api_key = "THE_ACTUAL_API_KEY"
openai_api_base = "https://autumn-snow-1380.ploomberapp.io/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

Another essential security requirement is to serve your API via HTTPS; however, this requires extra configuration, such as getting a TLS certificate. If you want to skip all this headache, skip to the final section, where we’ll show a one-click solution for securely deploying a vLLM server.

Considerations for a production deployment

Here are some considerations for a production deployment:

When deploying vLLM, you must ensure that the API restarts if it crashes (or if the physical server is restarted). You can do so with tools such as systemd.

To make your deployment more portable, we recommend using docker (more in the next section). Also, ensure to pin all Python dependencies so upgrades don’t break your installation (e.g., using pip freeze).

Using PyTorch’s docker image

We recommend using PyTorch’s official Docker image since it already comes with torch and CUDA drivers installed.

Here’s a sample Dockerfile you can use:

FROM pytorch/pytorch:2.1.2-cuda12.1-cudnn8-devel

WORKDIR /srv
RUN pip install vllm==0.3.3 --no-cache-dir

# if the model you want to serve requires you to accept the license terms,
# you must pass a HF_TOKEN environment variable, also ensure to pass a VLLM_API_KEY
# environment variable to authenticate your API
ENTRYPOINT ["python", "-m", "vllm.entrypoints.openai.api_server", \
            "--host", "0.0.0.0", "--port", "80", \
            "--model", "google/gemma-2b-it", \
            # depending on your GPU, you might or might not need to pass --dtype
            "--dtype=half"]

Cautionary tale about a bug in the transformers==4.39.1 package

tl;dr; when installing vLLM in the official PyTorch docker image, ensure you use the image with the correct PyTorch version. To do so, check the corresponding pyproject.toml file

While developing this guide, we encountered a bug in the transformers package. We wrote a Dockerfile that used the torch==2.2.2 (the most recent version at this time of writing), and then installed vllm==0.3.3:

FROM pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel
RUN pip install vllm==0.3.3

However, when starting the vLLM server, we encountered the following error:

File /opt/conda/lib/python3.10/site-packages/transformers/utils/generic.py:478
    475     return output_type(**dict(zip(context, values)))
    477 if version.parse(get_torch_version()) >= version.parse("2.2"):
--> 478     _torch_pytree.register_pytree_node(
    479         ModelOutput,
    480         _model_output_flatten,
    481         partial(_model_output_unflatten, output_type=ModelOutput),
    482         serialized_type_name=f"{ModelOutput.__module__}.{ModelOutput.__name__}",
    483     )
    484 else:
    485     _torch_pytree._register_pytree_node(
    486         ModelOutput,
    487         _model_output_flatten,
    488         partial(_model_output_unflatten, output_type=ModelOutput),
    489     )

AttributeError: module 'torch.utils._pytree' has no attribute 'register_pytree_node'

Upon further investigation, we realized that the problem is in the transformers package, specifically, in the _is_package_available function.. This function determines the current torch version, which is used in several parts of the codebase. Even though, vLLM does not use transformers for inference, it seems to use it for loading model configuration parameters. The problem that the transformers library uses a method that might return an incorrect version.

In our case, the Docker image had torch==2.2.2, but since vllm==0.3.3 requires pyotrch==2.1.2, running pip install vllm==0.3.3 downgraded PyTorch to version 2.1.2, however, transformers thought it still had torch==2.2.2, crashing execution.

This happened with transformers==4.39.1, so it might be fixed in future versions.

Deploying on Ploomber Cloud

If you want to skip the configuration headache, you can deploy vLLM on Ploomber Cloud with one click. We ensure that:

  • All the proper CUDA drivers are installed
  • Optimize the hardware vLLM runs on to maximize efficiency
  • Provide you with a TLS certificate to serve over HTTPS
  • You can stop the server at any time to save costs

To learn more, check our documentation