Building Reliable LLMs for Production: Structured Outputs and Data Validation

1. Introduction

Ever tried to wrangle a large language model into giving you exactly what you need? When we're building real-world applications with AI, we need them to provide structured, reliable outputs. This is crucial for creating systems that can be depended upon in production environments.

In this post, I'm going to walk you through the challenges we face when trying to make LLMs behave predictably, and why it's essential to get this right. We'll cover a range of techniques, from basic prompt engineering to some advanced Python implementations, that'll help you build more robust AI systems.

So, are you ready to dive into building reliable AI systems? Let's get started with the fundamentals!

2. Understanding the Basics

When you first start working with OpenAI's models (or really, any LLM), you typically get back a string of text. It's great for creative writing or general conversation, but not so great for building reliable systems. Let me show you what I mean:

import openai
 
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "What's the weather like?"}]
)
 
print(response.choices[0].message.content)

This might output something like "The weather is sunny and warm today." Nice, but how do we reliably extract temperature or conditions from that? What if we need to populate a database or trigger specific actions based on the weather? That's where the need for structured output comes in.

As we build more complex applications, we need more control over the format and content of our LLM outputs. This is especially crucial when integrating AI into existing systems or when consistency is key for user experience. In the next sections, we'll explore how to achieve this control and reliability.

3. Prompt Engineering for Structured Output

Prompt engineering is our first line of defense in getting structured outputs from LLMs. The idea is simple: we tell the model exactly how we want the response formatted. Here's a basic example:

system_prompt = """
Please respond in JSON format with the following structure:
{
    "temperature": <temperature in Celsius>,
    "condition": <weather condition>,
    "humidity": <humidity percentage>
}
"""
 
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "What's the weather like in New York?"}
    ]
)

This approach can work well, but it's not foolproof. The model might still deviate from the requested format, especially with complex structures. Plus, it's vulnerable to prompt injections, where a user could potentially override your carefully crafted prompt. While it's a good starting point, we'll need more robust methods for production-grade applications.

4. JSON Mode

OpenAI introduced JSON mode to address the need for more consistent structured outputs. It's a simple flag we can set in our API call that tells the model to always return valid JSON. Here's how we use it:

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0613",
    messages=[{"role": "user", "content": "What's the weather like in New York?"}],
    response_format={"type": "json_object"}
)
 
weather_data = json.loads(response.choices[0].message.content)

This guarantees we'll get a JSON response, which we can easily parse in Python. However, it doesn't guarantee the specific structure of that JSON. We might get different keys or nested structures each time. So while it's more reliable than basic prompt engineering, we still need to handle potential variations in our code.

5. Function Calling

Function calling is a feature that allows us to define the structure we want more precisely. We describe a function to the model, and it generates data to match that function's parameters. Here's how it looks:

function_description = {
    "name": "get_weather",
    "description": "Get the current weather in a location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {"type": "string"},
            "temperature": {"type": "number"},
            "condition": {"type": "string"},
            "humidity": {"type": "number"}
        },
        "required": ["location", "temperature", "condition", "humidity"]
    }
}
 
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo-0613",
    messages=[{"role": "user", "content": "What's the weather like in New York?"}],
    functions=[function_description],
    function_call={"name": "get_weather"}
)
 
weather_data = json.loads(response.choices[0].message.function_call.arguments)

This gives us more control over the structure of the output, as the model tries to fill in the parameters we've defined. However, it's still not perfect - the model might sometimes return invalid values or miss required fields. In our next sections, we'll look at how to add stronger validation to ensure we're getting exactly what we need.

6. Introducing Pydantic for Data Validation

Now that we're getting structured data from our LLM, we need a way to ensure it meets our expectations. Enter Pydantic, a data validation library that uses Python type annotations. Here's a basic example of how we can define a model for our weather data:

from pydantic import BaseModel, Field
 
class WeatherData(BaseModel):
    location: str
    temperature: float = Field(..., ge=-273.15)  # must be above absolute zero
    condition: str
    humidity: float = Field(..., ge=0, le=100)  # 0-100%
 
# Usage:
try:
    weather = WeatherData(**weather_data)
    print(f"It's {weather.temperature}°C and {weather.condition} in {weather.location}")
except ValidationError as e:
    print(f"Invalid data: {e}")

Pydantic allows us to define the exact structure we expect, including type checking and custom validators. If the data doesn't match our model, Pydantic raises a ValidationError, which we can catch and handle. This gives us a powerful way to ensure the LLM's output meets our requirements before we use it in our application.

7. Combining OpenAI Outputs with Pydantic

Now that we have our Pydantic model, let's use it to validate the output from our function calling API. This combination gives us both structured output from the LLM and strong validation:

def get_weather_data(location: str) -> WeatherData:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        messages=[{"role": "user", "content": f"What's the weather like in {location}?"}],
        functions=[function_description],
        function_call={"name": "get_weather"}
    )
 
    raw_data = json.loads(response.choices[0].message.function_call.arguments)
 
    try:
        return WeatherData(**raw_data)
    except ValidationError as e:
        print(f"LLM returned invalid data: {e}")
        raise
 
weather = get_weather_data("New York")

In this setup, if the LLM returns data that doesn't match our Pydantic model, we'll get a ValidationError. This allows us to catch and handle any inconsistencies before they propagate through our system. It's a powerful combination that gives us more confidence in the data we're working with.

8. Advanced Techniques with Python and Pydantic

Let's take our weather example a step further. We can use Pydantic's Enum class for controlled categorization and add a confidence score to our model:

from enum import Enum
from pydantic import BaseModel, Field, validator
 
class WeatherCondition(str, Enum):
    SUNNY = "sunny"
    CLOUDY = "cloudy"
    RAINY = "rainy"
    SNOWY = "snowy"
 
class WeatherData(BaseModel):
    location: str
    temperature: float = Field(..., ge=-273.15)
    condition: WeatherCondition
    humidity: float = Field(..., ge=0, le=100)
    confidence: float = Field(..., ge=0, le=1)
 
    @validator('confidence')
    def check_confidence(cls, v):
        if v < 0.5:
            print(f"Warning: Low confidence score ({v})")
        return v

This enhanced model ensures that the weather condition is one of our predefined options and includes a confidence score. We've also added a custom validator that warns us if the confidence is low. By using these Pydantic features, we're adding another layer of reliability to our LLM-powered system.

9. Implementing Self-Correction and Retries

Sometimes, despite our best efforts, the LLM might return data that doesn't pass our Pydantic validation. Instead of immediately failing, we can implement a retry mechanism that gives the LLM a chance to correct itself. Here's how we might do that:

from tenacity import retry, stop_after_attempt, wait_fixed
 
@retry(stop=stop_after_attempt(3), wait=wait_fixed(1))
def get_validated_weather_data(location: str) -> WeatherData:
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-0613",
        messages=[{"role": "user", "content": f"What's the weather like in {location}?"}],
        functions=[function_description],
        function_call={"name": "get_weather"}
    )
 
    raw_data = json.loads(response.choices[0].message.function_call.arguments)
 
    try:
        return WeatherData(**raw_data)
    except ValidationError as e:
        print(f"Validation failed: {e}. Retrying...")
        raise  # This will trigger a retry
 
weather = get_validated_weather_data("New York")

In this setup, we're using the tenacity library to automatically retry our function up to three times if a ValidationError occurs. This gives the LLM multiple chances to provide valid data, increasing the reliability of our system. Of course, we need to balance the desire for perfect data with the cost and time of multiple API calls.

10. Content Filtering and Moderation

Even with structured outputs and validation, we still need to be cautious about the content of LLM responses. Let's implement a basic content filter using both keyword checking and sentiment analysis:

from textblob import TextBlob
 
class ModeratedWeatherData(WeatherData):
    @validator('condition')
    def check_content(cls, v, values):
        text = f"The weather in {values['location']} is {v}"
 
        # Keyword check
        keywords = ['disaster', 'catastrophe', 'emergency']
        if any(keyword in text.lower() for keyword in keywords):
            raise ValueError(f"Inappropriate content detected: {text}")
 
        # Sentiment check
        sentiment = TextBlob(text).sentiment.polarity
        if sentiment < -0.5:
            raise ValueError(f"Overly negative content detected: {text}")
 
        return v
 
weather = ModeratedWeatherData(**raw_data)

This example uses a simple keyword check and the TextBlob library for sentiment analysis. We're ensuring that the weather description doesn't include alarming words or overly negative sentiment. For more robust content moderation, you might consider using dedicated content moderation APIs or more sophisticated NLP techniques.

Remember, the goal here is to add an extra layer of safety to prevent potentially inappropriate or alarming content from being passed through our system. This becomes especially important in user-facing applications or when dealing with sensitive topics.