A guide to deploying Outlines and structuring LLM outputs

author image

Overview of Outlines

Outlines is a Python library designed to simplify the usage of Large Language Models (LLMs) with structured generation. Structured generation is the process of taking the output of an LLM and transforming it into a more suitable format. This is very useful when you are using LLMs to generate any form of structured data. Here are a few reasons why you might want to use it:

  • Structured output ensures that the generated text conforms to a specific format or schema, making it easier to integrate with other systems, APIs, or applications.
  • Outputs generated in standardized formats such as JSON, CSV, etc can be easily parsed by automated scripts, algorithms, or data pipelines. This enables efficient extraction of relevant information and subsequent analysis.
  • Organizing the information into well-defined fields makes it easier to understand the content and its context.

Main features

Now let’s discuss the key features provided by Outlines:

  • JSON structured generation: Outlines can make any open-source model return a JSON object that follows a structure that is specified by the user. This functionality is useful when we require the model’s output to undergo downstream processing within our codebase, e.g.,
    • Parse the answer (e.g. with Pydantic), store it somewhere, return it to a user, etc.
    • Call a function with the result
  • JSON mode for vLLM: Allows an LLM service to be deployed using the JSON structured output and vLLM.
  • Make LLMs follow a Regex: This feature guarantees that the text generated by the LLM is a valid regular expression.
  • Powerful prompt templating: Outlines simplifies prompt management with prompt functions. Prompt functions are Python functions containing prompt templates in their docstrings. Arguments of these functions correspond to prompt variables, and when invoked, they return the template populated with argument values.

To find out more about the features of Outlines, check out their documentation.


Outlines can be installed by running the following command:

pip install outlines

Outlines can also be deployed as an LLM service with vLLM and a FastAPI server. vLLM isn’t installed by default, so you’ll need to install it separately:

pip install outlines[serve]

Keep in mind that vLLM requires Linux and Python >=3.8. Furthermore, it requires a GPU with compute capability >=7.0 (e.g., V100, T4, RTX20xx, A100, L4, H100).

Finally, vLLM is compiled with CUDA 12.1, so you need to ensure that your machine is running such CUDA version. To check it, run:

nvcc --version

If you’re not running CUDA 12.1 you can either install a version of vLLM compiled with the CUDA version you’re running (see the installation instructions to learn more), or install CUDA 12.1.

For step-by-step instructions on installing vLLM, you can explore this blog post for a detailed guide.

Start an Outlines server

Once vLLM is installed you can start the server by running:

python -m outlines.serve.serve --model=<model_name>

Alternatively, you can install and run the server with Outlines' official Docker image using the command:

docker run -p 8000:8000 outlinesdev/outlines --model=<model_name>

Let’s see an example of starting the server using the google/gemma-2b model. Note that some models, such as this require you to accept their license. Hence, you need to create a HuggingFace account, accept the model’s license, and generate a token.

For example, when opening google/gemma-2b on HuggingFace (you need to be logged in), you’ll see this:

accept gemma license

Once you accept the license, head over to the tokens section, and grab a token, then, before starting vLLM, set the token as follows:


Once the token is set, you can start the server.

python -m "outlines.serve.serve --model google/gemma-2b-it

Making requests

Once the server is up and running, you’re ready to send requests. You can query the model by providing a prompt along with either a JSON Schema specification or a Regex pattern.

Let’s look at an example using the JSON Schema specification. We’ll use the google/gemma-2b model and the Python requests library:

import json
import requests

# change for your host
url = f"{OUTLINES_HOST}/generate"

schema = {
  "type": "object",
  "properties": {
    "a": {
      "type": "integer"
    "b": {
      "type": "integer"
  "required": ["a", "b"]

headers = {"Content-Type": "application/json"}
data = {
    "prompt": "Return two integers named a and b respectively. a is odd and b even.",
    "schema": schema

response =, headers=headers, data=json.dumps(data))


The output generated was as follows:

['Return two integers named a and b respectively. a is odd and b even.{"a": 1, "b": 2}']

Deploying on Ploomber Cloud

To avoid the hassle of configuration, you can deploy vLLM on Ploomber Cloud with just one click.

Start by creating an account on Ploomber Cloud.

For a sample application, refer to the example repository. The deployment steps are similar to those outlined in the vLLM deployment guide.

Generate a zip file from the Dockerfile and the dependencies file. Login to your Ploomber Cloud account and follow the steps here to deploy it as a Docker application.

If your model requires license acceptance, you will need to provide a valid HF_TOKEN in the Secrets section for vLLM to download the weights. Additionally, you can protect your server by setting the VLLM_API_KEY secret, which you can generate with the following command:

python -c 'import secrets; print(secrets.token_urlsafe())'


You also need to select GPU for the deployment to work:


Once the deployment is complete, you should see the below logs:


After the deployment, you have the option to either send cURL requests or write a Python script using the requests library to interact with the model.

Refer to the Serve with vLLM guide to learn more.


Let’s quickly recap the key points discussed in the post:

  • Outlines is a powerful Python library that enables structured output generation from LLMs.
  • Structured output enables easy integration with other systems or APIs, improves data accessibility, and simplifies data analysis through standardized formats like JSON and CSV.
  • The library can be served via vLLM, and the model can be accessed through cURL requests or the Python requests library.
  • If you lack a GPU or prefer avoiding configuration hassles, you can opt for deployment in Ploomber Cloud.

Deploy Outlines served by vLLM with Ploomber

Recent Articles

Try Ploomber Cloud Now

Get Started