Structured outputs can hurt the performance of LLMs
When I read Let Me Speak Freely? by Tam et al. I thought they raised an interesting question: does constraining LLM outputs to structured formats impact the quality of their responses?
In both the original study and their recent update, Tam et al. concluded that is the case. They found that “structured generation constraints significantly impact LLM performance across various tasks”.
But the study had major flaws. The .txt team wrote a very compelling rebuttal to the paper. For Llama-3-8B-Instruct, they demonstrate that Tam, et al. results were mostly due to poor prompting, unfair comparisons and the improper use of an “AI parser” rather than the use of structured outputs.
I liked the rebuttal but it still left me wondering how well their results generalize. They focused on a single model1, which represents a small fraction of the LLMs powering applications in production today. Open-weight models offer more flexibility on how to structure your output, such as letting users specify regex expressions to constrain the output. Proprietary models lack this. Right now, JSON is the only structured output format guaranteed to work across most popular providers.
Given this constraint, would the .txt team’s results still hold?
Plus, both the original study and the rebuttal focused on tasks that might not be a good proxy for the full range of tasks people use LLMs for. Would the rebuttal results be different in settings outside of simple reasoning tasks?
So I decided to:
- Replicate the results from .txt’s rebuttal using LLaMA3-8B-Instruct.
- Replicate the same tasks using a proprietary model GPT-4o-mini.
- Test results on a broader set of tasks such as LiveBench.
This article presents the results of the first two steps. All the code is available on Github.
Results
If you’re short on time, here are the key findings:
- Tam et al.’s conclusions about structured outputs might still hold, even if they did not properly test for it. There are tasks where structured outputs perform worse than unstructured ones.
- .txt’s rebuttal is accurate. It shows that structured outputs are as good or better than unstructured outputs for LLaMA3-8B-Instruct in the tasks considered. But a similar approach did not hold for GPT-4o-mini (and possibly other models).
The figure below shows the results for GPT-4o-mini using .txt’s prompt fixes, along with additional improvements I made.
For GSM8K and Last Letter, structured and unstructured methods scored similarly. But for Shuffled Objects, unstructured outputs clearly surpassed a structured format.
The rest of the article will explain the approach I took to get these results.
Study design
Tam et al. evaluated structured and unstructured outputs across three reasoning tasks and six classification tasks. They used exact match to evaluate the reasoning tasks and accuracy to evaluate the classification tasks. They ran the experiments using the following models:
- Proprietary models: gpt-3.5-turbo-0125, claude-3-haiku-20240307, gemini-1.5-flash, and gpt-4o-mini-2024-07-18.
- Open-weight models: LLaMA3-8B-Instruct, and Gemma-2-9B-Instruct.
.txt used a similar setup, but only focused on the reasoning tasks using LLaMA3-8B-Instruct. They did not include classification tasks because Tam et al. observed that structured outputs resulted in better performance in these tasks, so there was no need to test for it.
I also believe that structured outputs are better for classification tasks. So, I excluded them from my analysis as well.
The reasoning tasks were:
- GSM8K: A dataset from of grade school math word problems.
- Last Letter: A dataset of simple word puzzles that require concatening the last letters of a list of names.
- Shuffled Objects: A dataset that requires reasoning about the state of a system after a sequence of shuffling operations.
The rest of the article details the process of replicating .txt’s rebuttal on these tasks and evaluating the same tasks using GPT-4o-mini.
Replicating .txt’s rebuttal
.txt made it very easy to reproduce their results by sharing their code on Github. I just set up a machine at Modal and ran the code.
While going through the code, I noticed some small issues with the prompts. So I decided to tweak them a bit.
Below are .txt’s original results compared to mine, after the prompt adjustments:
Task | .txt | me, 3-shot | ||
---|---|---|---|---|
Unstructured | Structured | Unstructured | Structured | |
GSM8K | 77.18 | 77.79 | 79.98 | 79.45 |
Last Letter | 73.33 | 77.33 | 74.00 | 78.00 |
Shuffled Objects | 40.72 | 44.35 | 42.68 | 43.90 |
Except for Structured in the Shuffled Objects task, I was able to improve all the metrics. In GSM8K’s case, even reversing .txt’s result, with Unstructured outperforming Structured by a small margin.
But I don’t think this matters much.
Their conclusion still holds: structured outputs are either as good as or better than unstructured outputs, in the tasks considered.
I’ll explain the prompt changes I made below, so that you can judge for yourself if they make sense.
Formatting few-shot examples
In the GSM8K and Last Letter tasks, the few-shot prompt for both unstructured and structured used examples formatted as JSON objects and asked the LLM to produce the output in the same format, from which the answer was extracted.
That felt unfair. Even though you’re not formally constraining the LLM to produce a JSON object, you’re still asking it to format its response in somewhat unnatural way.
I adjusted the prompts to be as similar as possible for both unstructured and structured outputs while still trying to get the most out of each approach.
For example, in GSM8K, the unstructured prompt is:
You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it. You will always respond in the following format:
<str, reasoning about the answer>
ANSWER: <int, final answer>
First, provide your step by step reasoning. Then, in ANSWER, provide an integer that corresponds to the correct answer to the question. Don’t include any other text in ANSWER.
And the structured prompt is:
You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it. You will always respond in the following format:
{“reasoning”: <str, reasoning about the answer>, “answer”: <int, final answer>}
First, provide your step by step reasoning in the “reasoning” field. Then, in the “answer” field, provide an integer that corresponds to the correct answer to the question. Don’t include any other text in the “answer” field.
Finally, for all the tasks, I used a 3-shot prompt.
Clarifying the task
I also tried to make the prompts clearer. The description of the task in the original Last Letter prompt was:
You are an expert in solving simple word puzzles using reasoning steps. Your specific task is going to be to take a list of 4 names and reason about the last letter of each ., then you will concatenate those letters into a word.
I changed it to:
You are an expert in solving word puzzles. Your specific task is going to be to take a list of 4 names, get the last letter of each and concatenate these letters into a word.
The original prompt was reasonable, but I thought the new version was clearer. Through trial and error, I’ve learned that when working with LLMs, it’s best to be as clear and direct as possible.
Evaluating GPT-4o-mini
Using the same setup as before, I ran the same tasks with gpt-4o-mini-2024-07-18
.
In the table below, you can see the results, including the original results from Tam et al. for comparison:
NL | FRI | JSON-Mode | JSON-Schema | ||
---|---|---|---|---|---|
Task | Method | ||||
GSM8K | Tam et al. | 94.57 | 87.17 | 86.95 | 91.71 |
Me, 0-shot | 94.31 | 92.12 | 93.33 | 93.48 | |
Me, 3-shot | 93.86 | 92.72 | 93.25 | 92.95 | |
Last Letter | Tam et al. | 83.11 | 84.73 | 76.00 | 86.07 |
Me, 0-shot | 87.33 | 88.00 | 90.00 | 87.33 | |
Me, 3-shot | 92.00 | 94.67 | 90.00 | 93.33 | |
Shuffled Obj. | Tam et al. | 82.85 | 81.46 | 76.43 | 81.77 |
Me, 0-shot | 95.12 | 79.67 | 81.71 | 89.84 | |
Me, 3-shot | 92.68 | 69.51 | 62.60 | 65.85 |
NL stands for “Natural Language”, which would correspond to the Unstructured method in the previous table.
FRI stands for “Format Restricting Instructions”, which is a JSON generated through the OpenAI’s function calling. JSON-Mode is a JSON generated through the OpenAI’s JSON mode. JSON-Schema is a JSON generated using constrained decoding.
JSON-Schema is the closest equivalent to Structured as referenced in the previous table. But, in real-life applications, you don’t really care about how the output was generated. You just want to get the output in the format you want. So, for the sake of comparison, I will consider the three other methods equivalent to Structured as well.
Adjusting for proprietary models
In this case, I allowed for 3 retries in the case of parsing errors. I allowed for this because function calling had high error rates in the zero-shot prompting scenario.
These retries primarily affected FRI results. This might make the comparisons in Last Letter biased in favor of structured outputs (FRI was the best method in this case). But since JSON-Schema also outperformed NL in this case, this adjustment does not alter the overall conclusions. The other methods maintained error rates of <0.5% in GSM8K and 0% in Last Letter and Shuffled Objects.
I used slightly different parsing functions for Unstructured and Structured outputs. The Unstructured parser was more lenient, removing commas and periods at the end of responses. But I believe this remains a fair comparison given that in the Structured cases you provide a JSON schema which is more informative.
Analyzing the results
Similar to what the .txt team found, after adjusting the prompts, the performance of structured outputs increases substantially compared to Tam et al.
Except for NL in GSM8k and FRI in Last Letter, I was able to improve all the metrics for both unstructured and structured outputs using a 0-shot prompt. For 3-shot prompts, I improved GSM8k and Last Letter across all methods, and NL in Shuffled Objects.
For GSM8k and Last Letter, the results were very similar between unstructured and structured outputs. There was a slight edge for unstructured outputs in GSM8k and for structured outputs in Last Letter. In these cases, it’s not clear that one approach definitively outperforms the other.
On the other hand, Shuffled Objects shows a clear advantage for unstructured outputs over structured outputs. This was unexpected, and even after tweaking the prompts, I couldn’t fully close the gap.
Despite the issues in Tam et al.’s study, their conclusion appears to hold. In this particular scenario, using a fairly popular model with reasonable prompts, there is a significant difference in performance between structured and unstructured outputs.
In GSM8k and Last Letter, few-shot prompting generally decreased performance. This is in line with other analyses.
Conclusion
You’re here because you want to know whether to use structured or unstructured outputs. As a developer, I’m glad to say the answer is: it depends.
I love using structured outputs in my daily work, because it makes it much easier to work with the output of LLMs. I always encourage clients who aren’t using them yet to give them a try.
That said, until there’s stronger evidence showing that both approaches are equivalent, the best course of action is to test things for yourself. Run your own evals and make a decision based on data.
I expect that in most cases, structured outputs will have similar performance to unstructured outputs. But, if you blindly assume that structured outputs are always equal to or better than unstructured ones, you might be missing out on easy performance gains.
Take the example of Shuffled Objects with GPT-4o-mini. You could potentially reduce the gap between the two methods by continuing improving the prompts or by switching to a more powerful model. But the costs, in terms of time and effort, might be more than simply switching to unstructured outputs.
And this cuts both ways. Unstructured outputs aren’t inherently better or worse than structured ones. Again, the right choice depends on your task, the model, and your prompt engineering skills. Test both approaches, identify if there are differences, and choose what works best.
Footnotes
Although, they’ve also shared results of other open-weight models using a different setup.↩︎
Citation
@online{castillo2024,
author = {Castillo, Dylan},
title = {Structured Outputs Can Hurt the Performance of {LLMs}},
date = {2024-12-08},
url = {https://dylancastillo.co/posts/say-what-you-mean-sometimes.html},
langid = {en}
}