The good, the bad, and the ugly of Gemini’s structured outputs
For a long time, I didn’t pay much attention to the idea that structured outputs could have an impact on the performance of LLMs. But after reading Let Me Speak Freely? and .txt’s rebuttal, I started to question my assumptions.
I decided to run some benchmarks myself using popular proprietary models, starting with GPT-4o-mini. During my analysis, I found that structured outputs did, in fact, decrease performance on some tasks.
After that, I was curious to see if the same was true for Gemini 1.5 Flash. This time, the answer wasn’t so straightforward, which is why I decided to write a separate post about it. In the process, I ran into an issue in Gemini’s Generative AI SDK that can break your application if you’re not careful.
In this article, I’ll share the results from running various benchmarks for Gemini 1.5 Flash comparing structured and unstructured outputs and the learnings I had along the way.
You can find the full code to run the benchmarks in this GitHub repository.
Results (TLDR)
Google recently released a new Python SDK for Gemini, which seems to address the automatically sorting keys issue.
If you’re in a hurry, here are my key findings:
- The good: Overall, Gemini’s structured outputs performed on par with unstructured outputs.1
- The bad: This only holds for the less rigid interpretation of “structured outputs”. When testing constrained decoding specifically, Gemini’s structured outputs performed worse than unstructured outputs.
- The ugly: Function calling and constrained decoding have a big design flaw. The order of the keys specified in the schema is not preserved when using the Generative AI Python SDK. This will break chain-of-thought reasoning in your applications unless you use a workaround (that only works for constrained decoding).
The figure below shows the overall results for Gemini 1.5 Flash:
The figure above compares the performance of Gemini’s structured outputs to unstructured outputs. NL stands for Natural Language, which means the model writes the output in a free-form manner. In contrast, JSON-Prompt and JSON-Schema involve structured outputs that follow a predefined JSON schema.
For JSON-Prompt, the JSON schema is included in the prompt, instructing the model to generate JSON formatted output based on its MIME type configuration. JSON-Schema works similarly, but the schema is set directly in the model’s configuration instead of being included in the prompt.
When considering both JSON-Prompt and JSON-Schema, Gemini’s structured outputs performed comparably to unstructured outputs. However, with JSON-Schema alone (i.e., constrained decoding), performance drops compared to unstructured outputs. This difference is most evident in the Shuffled Objects task, where NL achieved a score of 97.15%, while JSON-Schema scored 86.18%.
Structured outputs 101
In case you’re not familiar with the concept, structured outputs are a group of methods that “ensure that the model outputs adhere to a specific structure”. In Open weight models, a structure can mean anything from a JSON schema to a specific regex pattern. With proprietary models, it usually means a JSON schema.
In its less rigid interpretation, this includes any method that can generate LLM outputs adhering to a structured format, such as prompting, function calling, JSON mode, or constrained decoding.
In its more rigid interpretation, this includes only constrained decoding, as it is the only method that will guarantee the output adheres to the schema you provide.
Study design
In Let Me Speak Freely?, Tam et al. evaluated structured and unstructured outputs across three reasoning tasks and six classification tasks. They used exact match to evaluate the reasoning tasks and accuracy to evaluate the classification tasks. They ran the experiments using the following models:
- Proprietary models: gpt-3.5-turbo-0125, claude-3-haiku-20240307, gemini-1.5-flash, and gpt-4o-mini-2024-07-18.
- Open-weight models: LLaMA3-8B-Instruct, and Gemma-2-9B-Instruct.
For this article, I just focused on Gemini 1.5 Flash and the reasoning tasks. I already ran the benchmarks for GPT-4o-mini and Llama3-8B-Instruct in my previous post.
I excluded the classification tasks because I believe structured outputs perform better in classification tasks, and this is also in line with the study’s original findings. So I just focused on the three reasoning tasks:
- GSM8K: A dataset from of grade school math word problems.
- Last Letter: A dataset of simple word puzzles that require concatening the last letters of a list of names.
- Shuffled Objects: A dataset that requires reasoning about the state of a system after a sequence of shuffling operations.
The rest of the article details the process of re-evaluating these benchmarks using Gemini-1.5-Flash.
Structured outputs with Gemini
Gemini has three ways of generating structured outputs:
- Forced function calling (FC): You force the model to call a “function” and that makes the model generate a JSON schema that matches the function’s arguments.
- Schema in prompt (JSON-Prompt): You include a JSON schema in the prompt, specify
mime_type='application/json'
and the model generates a response that matches the schema. - Schema in model configuration (JSON-Schema): You provide a JSON schema in the model configuration, specify
mime_type='application/json'
in the request and the model generates a response that matches the schema. This is the only method that seems to use controlled generation.
I’ve included JSON-Prompt and JSON-Schema in the analysis, but had to exclude FC because it’s not possible to use it for chain-of-thought reasoning, which is a requirement for the benchmarks.
Issues with Gemini’s structured outputs
When running the three benchmarks, I quickly ran into a performance issue with FC and JSON-Schema. In all tasks, both showed double-digit performance drops compared to NL.
This didn’t make a lot of sense, so I started investigating.
I was using the following response schema for all structured outputs:
The prompts were similar to the one below, adjusted according to the task:
You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it.
You will always respond with JSON matching the following schema:
[RESPONSE_SCHEMA]
First, provide your step by step reasoning in the “reasoning” field. Then, in the “answer” field, provide your final answer. Don’t include any other text in the “answer” field.
I eventually realized that the performance drop in JSON-Schema was due to the keys in the schema being reversed when generating the response. I then noticed that Tam et al. briefly mentioned in Let Me Speak Freely? that JSON-Schema responses failed to produce valid JSON due to this exact problem, so they did not include it in their analysis.
I didn’t want to exclude it from the analysis, so I started looking for a way to control the order of the keys in the schema. I found that the order of the keys in the schema gets sorted alphabetically before the response is generated. To confirm this, I ran a test with 100 randomly generated schemas. Every resulting output had its keys sorted alphabetically, so it’s clear this is not by chance.
I also found that the Vertex AI documentation mentions a propertyOrdering
parameter, which should allow control over the order of keys. However, this feature doesn’t appear to work with the Generative AI Python SDK.
Unable to use the propertyOrdering
parameter, I resorted to a quick workaround: I named the keys in a way that forced the desired order alphabetically. Instead of using reasoning
and answer
, I used reasoning
and solution
. This preserved the chain-of-thought reasoning in the responses, and resolved the performance drop in JSON-Schema.
But FC was a different story. Unlike JSON-Schema, the order of the keys follow a less predictable pattern, and I couldn’t find a way to control it. So I decided to exclude FC from the analysis.
Benchmarks of Gemini 1.5 Flash
After applying the key ordering workaround, and additional improvements discussed in my previous post, I recomputed the benchmarks.
The table below shows the results for Gemini 1.5 Flash compared to the original results from Tam et al.
NL | JSON-Prompt | JSON-Schema | ||
---|---|---|---|---|
Task | Method | |||
GSM8K | Tam et al. | 89.33 | 89.21 | 47.78 |
Me, 0-shot | 93.71 | 93.78 | 93.03 | |
Me, 3-shot | 94.84 | 94.16 | 93.63 | |
Last Letter | Tam et al. | 65.45 | 77.02 | 0.67 |
Me, 0-shot | 82.67 | 80.00 | 81.33 | |
Me, 3-shot | 80.00 | 82.00 | 80.67 | |
Shuffled Obj. | Tam et al. | 58.21 | 65.07 | N/A |
Me, 0-shot | 97.15 | 92.28 | 86.18 | |
Me, 3-shot | 92.68 | 98.37 | 84.96 |
Using a 0-shot prompt and 3-shot prompts, I was able to improve all the metrics on the tasks and methods Tam et al. used for their benchmarks. Which is great!
NL and JSON-Prompt are tied, without a clear winner between them. Each method got a slight edge over in 3 out of 6 tasks. On the other hand, JSON-Schema performed worst than NL in 5 out of 6 tasks. Plus, in Shuffled Objects, it did so with a gap of more than 10 percentage points: 97.15% for NL vs. 86.18% for JSON-Schema.
Tam et al. defined structured outputs as any method that “involves providing output in standardized formats like JSON or XML through format restriction.” which is in line with the less rigid interpretation of structured outputs. Using this definition, the results show no performance gap between structured and unstructured outputs. This directly contradicts the study’s original claim.
But if you take the also common interpretation that constrained decoding is the only form of structured generation, then the study’s original conclusion still applies: there is indeed a significant performance gap between structured and unstructured outputs.
Weird, I know. But that’s the way it is.
Conclusion
Results are mixed for Gemini 1.5 Flash.
The good news is that structured outputs perform on par with unstructured ones. The bad news is that this only holds if you adopt the more flexible definition of “structured outputs.” And the ugly news is that Gemini’s Generative AI SDK has a major issue in how it handles the order of keys in the provided schema.
Based on the results, I’d suggest the following when using Gemini:
- Avoid FC for any tasks that require chain-of-thought reasoning.
- Default to JSON-Prompt over JSON-Schema for reasoning tasks.
Finally, I want to emphasize that I love working with structured outputs. They save a lot of time. But I know there are tasks where they might perform worse (or better!) than unstructured outputs. There’s not enough evidence to support one or the other, so what I should just run my own evals and decide based on that.
That’s the real takeaway: run your own evals and choose the approach that works best for you. Don’t blindly trust random posts online.
You can find the code to replicate my results in this GitHub repository.
Footnotes
Assuming the less rigid interpretation of “structured outputs”.↩︎
Citation
@online{castillo2024,
author = {Castillo, Dylan},
title = {The Good, the Bad, and the Ugly of {Gemini’s} Structured
Outputs},
date = {2024-12-27},
url = {https://dylancastillo.co/posts/gemini-structured-outputs.html},
langid = {en}
}