import asyncio
import json
from asyncio import Semaphore
from datetime import datetime
from pathlib import Path
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from pydantic import BaseModel
from scipy import stats
np.random.seed(42)
load_dotenv()
client = wrap_openai(AsyncOpenAI())
Structured outputs: don’t put the cart before the horse
Not long ago, you couldn’t reliably ask an LLM to provide you with a response using a specific format. Building tools that used LLM outputs was painful.
Then, through function calling and structured outputs, we could instruct LLMs to respond in specific formats1. So, extracting information from LLM outputs stopped being a problem.
But then I started noticing that structured outputs also had their own set of problems. Most importantly, the apparent rigidity of a Pydantic model can make you forget that underneath, you’re still dealing with an LLM. Setting up a response model for your API calls is not the same as setting up a response model for your LLM outputs.
For example, take the following question from the LiveBench dataset:
Suppose I have a physical, solid, equilateral triangle, and I make two cuts. The two cuts are from two parallel lines, and both cuts pass through the interior of the triangle. How many pieces are there after the cuts? Think step by step, and then put your answer in bold as a single integer (for example, 0). If you don’t know, guess.
Let’s say I write a simple system prompt and two Pydantic models to format the responses:
system_prompt = (
"You're a helpful assistant. You will help me answer a question."
"\nYou will use this JSON schema for your response:"
"\n{response_format}"
)
class ResponseFormatA(BaseModel):
reasoning: str
answer: str
class ResponseFormatB(BaseModel):
answer: str
reasoning: str
Do you think that there will be a difference in performance between ResponseFormatA
and ResponseFormatB
? If so, which one do you think will perform better?
Not sure? Well, you’re in luck! Let’s run some experiments to find out.
Set up the environment
First, start by importing the necessary libraries:
This will set up all the necessary infrastructure to run the experiments. I like using LangSmith to track runs.
To run the experiment, you need some data. I ended up using a subset of the reasoning questions from LiveBench. You can download it and save it in the data
directory.
Then, you can read it into a pandas DataFrame
:
data_dir = Path().absolute().parent / "data" / "live_bench"
reasoning_dir = data_dir / "reasoning"
live_bench_json = reasoning_dir / "question.jsonl"
df = (
pd.read_json(live_bench_json, lines=True)
.query("livebench_release_date == '2024-07-26'")
.assign(
turns_str=lambda x: x.turns.str[0],
expects_integer=lambda x: x.turns.str[0].str.contains("integer", case=False)
)
.reset_index()
.rename(columns={"index": "data_point_id"})
)
Next, define the system prompt and the Pydantic models you’ll use to format the responses:
In the system prompt you send to the LLM, you’ll replace {response_format}
with the JSON schema of the response format you want to use.
Then, let’s define a few helper functions to run the experiment:
def validate_response(response_json, response_format):
response_dict = json.loads(response_json)
expected_keys = list(response_format.model_json_schema()["properties"].keys())
actual_keys = list(response_dict.keys())
if actual_keys != expected_keys:
raise ValueError(f"Response keys {actual_keys} do not match expected keys {expected_keys}")
return response_format.model_validate_json(response_json)
@traceable
async def process_row(
row: pd.Series,
response_format: ResponseFormatA | ResponseFormatB,
semaphore: Semaphore
) -> ResponseFormatA | ResponseFormatB:
system_prompt = system_prompt_template.format(
response_format=response_format.model_json_schema()
)
async with semaphore:
for _ in range(3):
try:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Question:\n{row.turns_str}"}
],
response_format={"type": "json_object"}
)
response_json = response.choices[0].message.content
return validate_response(response_json, response_format)
except Exception:
pass
raise Exception("Failed to generate a valid response")
@traceable
async def main(df, response_format, concurrency: int = 30):
semaphore = Semaphore(concurrency)
tasks = [process_row(row, response_format, semaphore) for _, row in df.iterrows()]
responses = await asyncio.gather(*tasks)
return responses
def extract_answer(answer):
return str(answer).replace("**", "").strip()
In this code, validate_response
is used to check if the response is valid (i.e. it matches the JSON schema in the same order). If it is, it returns the response. Otherwise, it raises an exception.
extract_answer
is used to remove ** from the answer if it exists in the response. Some of the questions in the LiveBench dataset included instructions to put the answer in bold, which is why we need to remove it.
process_row
is used to process a single row of the DataFrame. It sends the system prompt to the LLM and validates the response. It includes a simple retry mechanism in case the validation fails. Each run is tracked in LangSmith.
Finally, main
is used to run the experiment. It runs the process_row
function concurrently for each row in the DataFrame.
Running the experiment
Now, you can run the experiment using the two response formats:
n_runs = 3
df_runs = []
for run in range(n_runs):
print(f"Run {run + 1}/{n_runs}")
df_copy = df.copy()
responses_A = asyncio.run(main(df_copy, ResponseFormatA))
df_copy["raw_answer_A"] = [r.answer for r in responses_A]
df_copy["response_A"] = df_copy["raw_answer_A"].apply(extract_answer)
df_copy["is_correct_A"] = (df_copy["response_A"] == df_copy["ground_truth"]).astype(int)
responses_B = asyncio.run(main(df_copy, ResponseFormatB))
df_copy["raw_answer_B"] = [r.answer for r in responses_B]
df_copy["response_B"] = df_copy["raw_answer_B"].apply(extract_answer)
df_copy["is_correct_B"] = (df_copy["response_B"] == df_copy["ground_truth"]).astype(int)
df_copy["run"] = run
df_run = df_copy[["data_point_id", "ground_truth", "is_correct_A", "is_correct_B", "run"]]
df_runs.append(df_run)
We run the experiment multiple times with the same inputs to account for the randomness in the LLM’s responses. Ideally, we should run it more than three times, but I’m poor. So, we’ll just do it 3 times.
df_all_runs = pd.concat(df_runs, ignore_index=True)
n_bootstraps = 10000
bootstrap_accuracies_A = []
bootstrap_accuracies_B = []
data_point_ids = df_all_runs['data_point_id'].unique()
n_data_points = len(data_point_ids)
grouped_A = df_all_runs.groupby('data_point_id')['is_correct_A']
grouped_B = df_all_runs.groupby('data_point_id')['is_correct_B']
df_correct_counts_A = grouped_A.sum()
df_total_counts_A = grouped_A.count()
df_correct_counts_B = grouped_B.sum()
df_total_counts_B = grouped_B.count()
for _ in range(n_bootstraps):
sampled_ids = np.random.choice(data_point_ids, size=n_data_points, replace=True)
sampled_counts = pd.Series(sampled_ids).value_counts()
counts_index = sampled_counts.index
total_correct_counts_A = (df_correct_counts_A.loc[counts_index] * sampled_counts).sum()
total_observations_A = (df_total_counts_A.loc[counts_index] * sampled_counts).sum()
mean_accuracy_A = total_correct_counts_A / total_observations_A
bootstrap_accuracies_A.append(mean_accuracy_A)
total_correct_counts_B = (df_correct_counts_B.loc[counts_index] * sampled_counts).sum()
total_observations_B = (df_total_counts_B.loc[counts_index] * sampled_counts).sum()
mean_accuracy_B = total_correct_counts_B / total_observations_B
bootstrap_accuracies_B.append(mean_accuracy_B)
ci_A = np.percentile(bootstrap_accuracies_A, [2.5, 97.5])
ci_B = np.percentile(bootstrap_accuracies_B, [2.5, 97.5])
mean_accuracy_A = df_all_runs['is_correct_A'].mean()
mean_accuracy_B = df_all_runs['is_correct_B'].mean()
print(
f"Response format A - Mean: {mean_accuracy_A * 100:.2f}% CI: {ci_A[0] * 100:.2f}% - {ci_A[1] * 100:.2f}%"
)
print(
f"Response format B - Mean: {mean_accuracy_B * 100:.2f}% CI: {ci_B[0] * 100:.2f}% - {ci_B[1] * 100:.2f}%"
)
Then, you can build bootstrap confidence intervals for the accuracies of the two response formats. Given that I’m asking the LLM the same question multiple times, I went with an approach called cluster bootstrapping, which accounts for the fact that the data points are not independent.
It should take a few seconds to run. Once it’s done, you should see output like the following:
Response Format | Accuracy (95% CI) |
---|---|
A | 46.67% (35.33% – 58.00%) |
B | 33.33% (22.67% – 44.67%) |
These results suggest that the order of the fields in the JSON schema does matter.
But if you’re still unsure, you can perform a t-test to see if the two response formats are statistically different:
accuracies_A = df_all_runs.pivot(index='data_point_id', columns='run', values='is_correct_A')
accuracies_B = df_all_runs.pivot(index='data_point_id', columns='run', values='is_correct_B')
mean_accuracies_A = accuracies_A.mean(axis=1)
mean_accuracies_B = accuracies_B.mean(axis=1)
t_stat, p_value = stats.ttest_rel(mean_accuracies_A, mean_accuracies_B, alternative='greater')
print(f"t-statistic: {t_stat}, p-value: {p_value}")
I got a p-value <0.01, meaning I can reject the null hypothesis that the two response formats are the same.
Conclusion
Based on the results of the experiment, we can safely say that ResponseFormatA
is better than ResponseFormatB
.
But why?
In this case, it’s simple.
These response formats are meant to help the LLM reason step by step to arrive at the answer. This is known as chain of thought reasoning. However, for it to work, we need the LLM to first provide us with the reasoning of how it arrived at the answer and then the answer.
In ResponseFormatA
, we defined our Pydantic model with the reasoning first and the answer second. This means that the LLM will give us the reasoning first, and then provide the answer. Which is exactly what we want.
ResponseFormatB
works in the opposite way. This means that the LLM will give us the answer first, and then provide the reasoning. So our chain of thought reasoning becomes a zero-shot prompt. In this case, the reasoning is a byproduct of the answer.
So, to summarize, when using structured outputs, don’t put the cart before the horse.
That’s all! Let me know if you have any questions in the comments.
Footnotes
Citation
@online{castillo2024,
author = {Castillo, Dylan},
title = {Structured Outputs: Don’t Put the Cart Before the Horse},
date = {2024-11-09},
url = {https://dylancastillo.co/posts/llm-pydantic-order-matters.html},
langid = {en}
}