Working with LLMs can feel like magic. With a well-crafted prompt, you can get it to do almost anything or create nearly anything for you. Despite this flexibility, they are also limited. In essence, they take in a string and output a string. You can do many things to get that output string to look like something else. You can tell it to make a list, dictionary, or JSON object. But that is a bit of a hack. In the end, it is still a string and the ability to enforce the object formats is limited. This means you can tell the LLM to always output in a dictionary format, but there is no guarantee that it will.
Here's a practical example of this from something I built. I would feed the LLM a string of comma-separated URLs. It was supposed to evaluate the URLs and then return me a comma-separated list of URLs. This sometimes worked well. But occasionally the LLM would throw in a '\n' after each comma. This was undesirable because it would break the downstream data processing. To counteract this I added instructions to the prompt to never include a '\n'. Despite the prompt changes, it would still include these sporadically. I added logic after the output was returned to strip these out in cases where the output included a '\n'.
def remove_duplicates(article_set):
""" Many sources may cover the same story in a given day. When this happens we only want to include the story one time using one source. The URLs for all scraped articles are sent to GPT
and using the keywords in the URL it determines which URLs represent unique stories. In cases where there are multiple sources for a story it selects what it believes to be just the most
reputable source and returns that URL.
:param str article_set: All the URLs for the scraped articles are combined into a string so they can be sent to GPT for evaluation
:return: list
"""
for attempt in range(5):
try:
# In this case I used gpt-3.5-turbo-16k because I needed to send all URLs at once so the needed token count was much higher. But this model is twice as expensive so where possible I
# used gpt-3.5-turbo
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-16k",
messages=[
{"role": "system",
"content": "You will receive a set of URLs. Analyze the keywords within each URL to identify potential overlapping content. If multiple URLs seem to discuss the same topic based on shared keywords (for instance, if 3 URLs contain the terms 'microsoft' and 'teams'), choose only one URL, giving preference to the most reputable source based on general knowledge about the source's reputation. After your analysis, provide a comma-separated list of unique URLs that correspond to distinct topics. Your response should only be the list of URLs, without any additional text, line breaks, or '\\n' characters."},
{"role": "user",
"content": article_set}
],
max_tokens=10000,
temperature=.2,
)
deduped_urls = response["choices"][0]["message"]["content"].replace('\\n', '').strip() # GPT only returns things in string format. So though the prompt asks for a comma-separated list, the list actually comes back as a string that you need to parse. On occasion GPT was appending a \\n to each URL which caused the subsequent parsing and matching to
# break. In the case that happens, this strips out the \\n
return deduped_urls
This was a simple structuring request to the LLM and it had trouble following it. Imagine if I had more complex structure needs. Well, I am not the only one who had issues like this. The need for more complex structures is widespread when working with LLMs. Many people have been working to fix the flaws in these outputs. Large providers like OpenAI and Anthropic introduced "JSON Modes" to their models. These allow you to define a structure and then tell the model to return its output in that format. But they were also complex to set up. These modes worked better but they were not always reliable. After doing the work to define the format and prompt the LLM, there would be occasions when it would not follow instructions. And unlike my example above it was hard to guard against the inconsistencies because they could vary so much. The model could: ignore the format completely, create keys that weren't in the format, or add whitespace infinitely.
# Define the structure for car information
car_info_schema = {
"type": "object",
"properties": {
"make": {"type": "string", "description": "The manufacturer of the car"},
"model": {"type": "string", "description": "The specific model of the car"},
"year": {"type": "integer", "description": "The year the car was manufactured"},
"features": {"type": "array", "items": {"type": "string"}, "description": "A list of notable features"},
"description": {"type": "string", "description": "A brief description of the car"}
},
"required": ["make", "model", "year", "features", "description"]
}
# Function to extract car information from user input
def extract_car_info(user_input: str) -> dict:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613", # Use the correct model that supports function calling
messages=[
{"role": "system", "content": "You are a helpful assistant that extracts car information from user input."},
{"role": "user", "content": f"Extract car information from the following text: {user_input}"}
],
functions=[{
"name": "extract_car_info",
"description": "Extracts structured car information from text",
"parameters": car_info_schema
}],
function_call={"name": "extract_car_info"}
)
# Parse the function call arguments (which are returned as a string)
function_args = json.loads(response.choices[0].function_call.arguments)
return function_args
There are other methods to enforce formatting. The most prominent of these is Instructor. It is similar in that you define a format for the model to follow. But, it has more controls and guardrails built-in. These ensure that the outputs follow the format at all times. You can see here that the configuration is similar to the JSON modes but it is a bit simpler.
import instructor
from pydantic import BaseModel
from openai import OpenAI
# Define your desired output structure
class UserInfo(BaseModel):
name: str
age: int
# Patch the OpenAI client
client = instructor.from_openai(OpenAI())
# Extract structured data from natural language
user_info = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=UserInfo,
messages=[{"role": "user", "content": "John Doe is 30 years old."}],
)
print(user_info.name)
#> John Doe
print(user_info.age)
#> 30
Most recently, OpenAI took a page from Instructor's playbook and introduced a Structured Outputs capability. The configuration for this is identical to Instructor's setup and the consistency of outputs is as good. The only real drawback seems to be that it is a bit slower than using Instructor.
from pydantic import BaseModel
from openai import OpenAI
client = OpenAI()
class CalendarEvent(BaseModel):
name: str
date: str
participants: list[str]
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the event information."},
{"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
],
response_format=CalendarEvent,
)
event = completion.choices[0].message.parsed
Between Instructor and OpenAI Structured Outputs, getting consistent outputs seems to be a solved problem. You can use the LLM of your choice based on quality, cost, speed, etc., and get the needed format. This gives users lots of power and flexibility to build in the way that makes the most sense to them.
To test this I created a book summarizer tool. The user inputs a summary of a book and the model extracts the required details from the summary. It then returns the details in the specified format. Since the Instructor and OpenAI formats are identical, I was able to build this for four different LLMs in under an hour. Now you can input one summary and get the outputs side-by-side. Then you can access the quality of outputs based on the model used. This is a simple example that can be expanded to fit more complex and challenging use cases.
This video demo walks through some of the background and setup. It also shows examples of the different outputs side-by-side. They all return the information in the same format, but the quality of what they return varies. It also gives you some ideas for combining the power of structured outputs with the knowledge embedded in the models themselves.
You can give this a try for yourself by grabbing the repo here: https://github.com/brayden-s-haws/structured_outputs_testing/tree/main