Structured outputs: don’t put the cart before the horse

llm
openai
pydantic
python
Author
Affiliation
Published

November 9, 2024

Modified

November 12, 2024

Not long ago, we couldn’t reliably ask LLMs to provide a response using a specific format. Building tools that used LLM outputs was painful.

Eventually, first through function calling and then through structured outputs, we could instruct LLMs to respond in specific formats1. So, extracting information from LLM outputs in a reliable way stopped being a problem.

But then I started noticing that structured outputs were not always the silver bullet people think they are. Defining response formats adds a sort of safety net, and people often forget that underneath, they’re still dealing with an LLM. Setting up a Pydantic model for your API calls is not the same as setting up a Pydantic model for your LLM outputs.

For example, take the following question from the LiveBench dataset:

Suppose I have a physical, solid, equilateral triangle, and I make two cuts. The two cuts are from two parallel lines, and both cuts pass through the interior of the triangle. How many pieces are there after the cuts? Think step by step, and then put your answer in bold as a single integer (for example, 0). If you don’t know, guess.

Let’s say I write a simple system prompt and two Pydantic models to format the responses:

system_prompt = (
    "You're a helpful assistant. You will help me answer a question."
    "\nYou will use this JSON schema for your response:"
    "\n{response_format}"
)

class ResponseFormatA(BaseModel):
    reasoning: str
    answer: str

class ResponseFormatB(BaseModel):
    answer: str
    reasoning: str

Do you think that there will be a difference in performance between ResponseFormatA and ResponseFormatB? If so, which one do you think will perform better?

Not sure? Well, you’re in luck! Let’s run some experiments to find out.

Set up the environment

First, start by importing the necessary libraries:

import asyncio
import json
from asyncio import Semaphore
from datetime import datetime
from pathlib import Path

import numpy as np
import pandas as pd
from dotenv import load_dotenv
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from pydantic import BaseModel
from scipy import stats

np.random.seed(42)

load_dotenv()

client = wrap_openai(AsyncOpenAI())

This will set up all the necessary infrastructure to run the experiments. I like using LangSmith to track runs.

To run the experiment, you need some data. I ended up using a subset of the reasoning questions from LiveBench. You can download it and save it in the data directory.

Then, you can read it into a pandas DataFrame:

data_dir = Path().absolute().parent / "data" / "live_bench"
reasoning_dir = data_dir / "reasoning"
live_bench_json = reasoning_dir / "question.jsonl"

df = (
    pd.read_json(live_bench_json, lines=True)
    .query("livebench_release_date == '2024-07-26'")
    .assign(
        turns_str=lambda x: x.turns.str[0], 
        expects_integer=lambda x: x.turns.str[0].str.contains("integer", case=False)
    )
    .reset_index()
    .rename(columns={"index": "data_point_id"})
)

Next, define the system prompt and the Pydantic models you’ll use to format the responses:

system_prompt_template = (
    "You're a helpful assistant. You will help me answer a question."
    "\nYou will use this JSON schema for your response:"
    "\n{response_format}"
)

class ResponseFormatA(BaseModel):
    reasoning: str
    answer: str 

class ResponseFormatB(BaseModel):
    answer: str 
    reasoning: str

In the system prompt you send to the LLM, you’ll replace {response_format} with the JSON schema of the response format you want to use.

Then, let’s define a few helper functions to run the experiment:

def validate_response(response_json, response_format):
    response_dict = json.loads(response_json)
    expected_keys = list(response_format.model_json_schema()["properties"].keys())
    actual_keys = list(response_dict.keys())
    if actual_keys != expected_keys:
        raise ValueError(f"Response keys {actual_keys} do not match expected keys {expected_keys}")
    return response_format.model_validate_json(response_json)

@traceable
async def process_row(
    row: pd.Series, 
    response_format: ResponseFormatA | ResponseFormatB, 
    semaphore: Semaphore
) -> ResponseFormatA | ResponseFormatB:
    system_prompt = system_prompt_template.format(
        response_format=response_format.model_json_schema()
    )
    async with semaphore:
        for _ in range(3):
            try:
                response = await client.chat.completions.create(
                    model="gpt-4o", 
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": f"Question:\n{row.turns_str}"}
                    ],
                    response_format={"type": "json_object"}
                )
                response_json = response.choices[0].message.content
                return validate_response(response_json, response_format)
            except Exception:
                pass
        raise Exception("Failed to generate a valid response")

@traceable
async def main(df, response_format, concurrency: int = 30):
    semaphore = Semaphore(concurrency)
    tasks = [process_row(row, response_format, semaphore) for _, row in df.iterrows()]
    responses = await asyncio.gather(*tasks)

    return responses

def extract_answer(answer):
    return str(answer).replace("**", "").strip()

In this code, validate_response is used to check if the response is valid (i.e. it matches the JSON schema in the same order). If it is, it returns the response. Otherwise, it raises an exception.

extract_answer is used to remove ** from the answer if it exists in the response. Some of the questions in the LiveBench dataset included instructions to put the answer in bold, which is why we need to remove it.

process_row is used to process a single row of the DataFrame. It sends the system prompt to the LLM and validates the response. It includes a simple retry mechanism in case the validation fails. Each run is tracked in LangSmith.

Finally, main is used to run the experiment. It runs the process_row function concurrently for each row in the DataFrame.

Running the experiment

Now, you can run the experiment using the two response formats:

n_runs = 3
df_runs = []

for run in range(n_runs):
    print(f"Run {run + 1}/{n_runs}")
    df_copy = df.copy()
    
    responses_A = asyncio.run(main(df_copy, ResponseFormatA))
    df_copy["raw_answer_A"] = [r.answer for r in responses_A]
    df_copy["response_A"] = df_copy["raw_answer_A"].apply(extract_answer)
    df_copy["is_correct_A"] = (df_copy["response_A"] == df_copy["ground_truth"]).astype(int)
    
    responses_B = asyncio.run(main(df_copy, ResponseFormatB))
    df_copy["raw_answer_B"] = [r.answer for r in responses_B]
    df_copy["response_B"] = df_copy["raw_answer_B"].apply(extract_answer)
    df_copy["is_correct_B"] = (df_copy["response_B"] == df_copy["ground_truth"]).astype(int)
    
    df_copy["run"] = run
    df_run = df_copy[["data_point_id", "ground_truth", "is_correct_A", "is_correct_B", "run"]]
    
    df_runs.append(df_run)

We run the experiment multiple times with the same inputs to account for the randomness in the LLM’s responses. Ideally, we should run it more than three times, but I’m poor. So, we’ll just do it 3 times.

df_all_runs = pd.concat(df_runs, ignore_index=True)

n_bootstraps = 10000
bootstrap_accuracies_A = []
bootstrap_accuracies_B = []

data_point_ids = df_all_runs['data_point_id'].unique()
n_data_points = len(data_point_ids)

grouped_A = df_all_runs.groupby('data_point_id')['is_correct_A']
grouped_B = df_all_runs.groupby('data_point_id')['is_correct_B']

df_correct_counts_A = grouped_A.sum()
df_total_counts_A = grouped_A.count()
df_correct_counts_B = grouped_B.sum()
df_total_counts_B = grouped_B.count()

for _ in range(n_bootstraps):
    sampled_ids = np.random.choice(data_point_ids, size=n_data_points, replace=True)
    sampled_counts = pd.Series(sampled_ids).value_counts()
    counts_index = sampled_counts.index
    
    total_correct_counts_A = (df_correct_counts_A.loc[counts_index] * sampled_counts).sum()
    total_observations_A = (df_total_counts_A.loc[counts_index] * sampled_counts).sum()
    mean_accuracy_A = total_correct_counts_A / total_observations_A
    bootstrap_accuracies_A.append(mean_accuracy_A)
    
    total_correct_counts_B = (df_correct_counts_B.loc[counts_index] * sampled_counts).sum()
    total_observations_B = (df_total_counts_B.loc[counts_index] * sampled_counts).sum()
    mean_accuracy_B = total_correct_counts_B / total_observations_B
    bootstrap_accuracies_B.append(mean_accuracy_B)

ci_A = np.percentile(bootstrap_accuracies_A, [2.5, 97.5])
ci_B = np.percentile(bootstrap_accuracies_B, [2.5, 97.5])

mean_accuracy_A = df_all_runs['is_correct_A'].mean()
mean_accuracy_B = df_all_runs['is_correct_B'].mean()

print(
    f"Response format A - Mean: {mean_accuracy_A * 100:.2f}% CI: {ci_A[0] * 100:.2f}% - {ci_A[1] * 100:.2f}%"
)
print(
    f"Response format B - Mean: {mean_accuracy_B * 100:.2f}% CI: {ci_B[0] * 100:.2f}% - {ci_B[1] * 100:.2f}%"
)

Then, you can build bootstrap confidence intervals for the accuracies of the two response formats. Given that I’m asking the LLM the same question multiple times, I went with an approach called cluster bootstrapping, which accounts for the fact that the data points are not independent.

It should take a few minutes to run. Once it’s done, you should see output like the following:

Response Format Accuracy (95% CI)
A 46.67% (35.33% – 58.00%)
B 33.33% (22.67% – 44.67%)

These results suggest that the order of the fields in the JSON schema does matter.

But if you’re still unsure, you can perform a t-test to see if the two response formats are statistically different:

accuracies_A = df_all_runs.pivot(index='data_point_id', columns='run', values='is_correct_A')
accuracies_B = df_all_runs.pivot(index='data_point_id', columns='run', values='is_correct_B')

mean_accuracies_A = accuracies_A.mean(axis=1)
mean_accuracies_B = accuracies_B.mean(axis=1)

t_stat, p_value = stats.ttest_rel(mean_accuracies_A, mean_accuracies_B, alternative='greater')

print(f"t-statistic: {t_stat}, p-value: {p_value}")

I got a p-value <0.01, meaning I can reject the null hypothesis that the two response formats are the same.

Conclusion

Based on the results of the experiment, we can safely say that ResponseFormatA is better than ResponseFormatB.

But why?

In this case, it’s simple.

These response formats are meant to help the LLM reason step by step to arrive at the answer. This is known as chain of thought reasoning. However, for it to work, we need the LLM to first provide us with the reasoning of how it arrived at the answer and then the answer.

In ResponseFormatA, we defined our Pydantic model with the reasoning first and the answer second. This means that the LLM will give us the reasoning first, and then provide the answer. Which is exactly what we want.

ResponseFormatB works in the opposite way. This means that the LLM will give us the answer first, and then provide the reasoning. So our chain of thought reasoning becomes a zero-shot prompt. In this case, the reasoning is a byproduct of the answer.

So, to summarize, when using structured outputs, don’t put the cart before the horse.

That’s all! Let me know if you have any questions in the comments.

Footnotes

  1. I’m referring to OpenAI models here. Open weight models allowed this using grammars.↩︎

Citation

BibTeX citation:
@online{castillo2024,
  author = {Castillo, Dylan},
  title = {Structured Outputs: Don’t Put the Cart Before the Horse},
  date = {2024-11-09},
  url = {https://dylancastillo.co/posts/llm-pydantic-order-matters.html},
  langid = {en}
}
For attribution, please cite this work as:
Castillo, Dylan. 2024. “Structured Outputs: Don’t Put the Cart Before the Horse.” November 9, 2024. https://dylancastillo.co/posts/llm-pydantic-order-matters.html.