Carving Out AI Talent: Halloween Pumpkin Judging with OpenAI and Anthropic Models

October 28, 2024
5 min read

Two advances in language models that have recently become mainstream are image analysis and structured output. We decided to do a Halloween themed evaluation of these two modes for OpenAI and Anthropics’s models.

Structured output modes constrain language models to generate their output according to a schema. This is great when using models to call functions, or to extract data according to a template. And image analysis lets models including Claude and GPT provide detailed textual responses based on input images. To test these capabilities we built LLM powered pumpkin carving judges using Anthropic’s Claude 3.5 Sonnet and OpenAI’s GPT 4o, and evaluated them on some AI generated pumpkin images.

Images

We built a pipeline using Claude (for prompt generation) and the Flux.1-schell image generation model1 to generate images of carved pumpkins. Claude was asked to generate a detailed description of a pumpkin with the scariness, creativity, and skill specified as in the table below:

The detailed descriptions were passed to the Flux model to generate images. Two examples are shown below, along with the prompts used.

Figure: Examples of images generated for testing, with prompts used to create them.

Scary, creative, and skilled

Menacing jack-o'-lantern with intricate filigree patterns, glowing crimson eyes, jagged teeth, twisted vines for hair, hyperrealistic carving, eerie backlighting, fog swirling, masterful craftsmanship

Not scary, conventional, low skill

Lopsided jack-o'-lantern with crude triangle eyes, crooked buck-toothed grin, carving knife marks visible, candlelight flickering weakly inside, sitting on suburban porch steps, surrounded by fallen leaves

We generated images for all possible combinations of skill, creativity, and scariness, for a total of 32 images.

Rating

Both GPT-4o and Claude Sonnet models can generate output based on a provided schema. We created the following class to represent the rating template. We ask the model to provide a score for originality, skill, and scariness, with an explanation of each. We also asked for it to generate as short story about the pumpkin.

Figure: Python class defining the pumpkin rater’s fields and describing them

With this schema and the images, we can follow the model provider examples for data extraction2,3 (they sometimes call it tool-use to try and make it sound like intelligence) and image analysis4,5.

Below is another example of an image and the output json structure that one of the models generates:

INPUT image:

OUTPUT Rating for an example pumpkin:

{'originality': 5,
 'originality_comments': 'This pumpkin design is incredibly original. Instead of traditional carving, it features an intricate, artistic scene painted or etched onto the surface. The design incorporates a whimsical forest scene with curling trees, a silhouette of a person, and various animals, all framed by swirling patterns around the edges. This level of detail and storytelling in a pumpkin design is highly uncommon and creative.',
 'skill': 5,
 'skill_comments': "The skill level demonstrated in this pumpkin design is exceptional. The artist has created a complex, multilayered scene with incredible detail. The use of light and shadow to create depth, the precision of the fine lines in the tree branches and swirls, and the balanced composition all indicate a high level of artistic ability. The seamless integration of the scene with the pumpkin's natural shape shows mastery of the medium.",
 'scariness': 2,
 'scariness_comments': "While beautifully crafted, this pumpkin design isn't particularly scary. It evokes more of a magical, enchanted forest feeling rather than a frightening Halloween atmosphere. The silhouettes and curling branches create a slightly eerie mood, but overall the scene is more whimsical than scary.",
 'story': "As twilight fell on All Hallows' Eve, young Lily ventured into the Whispering Woods, where the very trees seemed to come alive. Guided by her animal companions, she sought the hidden pumpkin patch said to glow with otherworldly light. Little did she know, each step drew her closer to a magical realm where Halloween spirits danced among the swirling branches, waiting to grant her heart's deepest wish."}

This output format is very convenient because it is now rigidly structured. Reading the responses it is clear that the model (in this case Claude) is able to process subtle details from the image and “see” it similarly to how a person would.

Evaluation

For a production system it might be desirable to do a full analysis that covers robustness, security, bias, and other concerns.Here we want to run some simple checks on the systems to understand how they perform at their task.

Accuracy

The first check is a comparison of the model ratings to the “ground truth” rating to see how accurate it is. During the data generation process, we specified the creativity, originality, and skill of the design. These become the ground truth reference: we assign a 1/5 rating to “low” and a 5/5 to “high” and then look at the correlation with the scores the models generate for the 32 images. The correlation values for the three criteria are shown below:

Figure: Correlation between scariness, originality, and skill as scored by the pumpkin rating system and the ground truth used to generate the images.

The correlation value ranges from -1 to 1 with 1 being the “best” match between ground truth and prediction. Both GPT and Claude based systems perform almost identically here, with Claude marginally ahead for 2/3 cases, but not materially. It’s clear that for both models, the degree of correlation depends on the aspect of the pumpkin being judged. There is a high (~0.9) correlation for scariness, originality is at about .5 - still some predictive power, and skill at about 0.25 indicating much less of a clear relationship. 

One reason for the difference could be ambiguity around the meaning of originality and skill. For example, anecdotally, some of the “low skill” images still were rated more highly because they contained “clean lines” even if the overall design was misshapen. Also, since we used synthetic data, we are evaluating the consistency between the image generation model’s (Flux’s) interpretation of the qualities, and that of the LLMs. 

Hallucination

A hot topic that’s relevant to pumpkin judging is whether a model is hallucinating or confabulating (making up) details in its response. Since the models generate comments on their rating, we can look at these to confirm whether they are taking any liberties.

In order to do this check, we separately asked Claude 3.5 Sonnet to generate detailed descriptions of each image. Here is an example for a high-skill, not scary, not creative pumpkin:

 

Image:

Description:

The image shows a classic Halloween jack-o'-lantern placed on what appears to be a wooden surface, possibly a porch or deck. The pumpkin is a bright orange color with a carved face that's illuminated from within, creating a warm, glowing effect. The face has triangular eyes and a nose, with a wide, jagged mouth forming a cheerful grin.
The pumpkin's stem is visible at the top, curling slightly. The surface of the pumpkin has natural ridges and textures, giving it a realistic appearance. The carving is clean and well-executed, resulting in a friendly and inviting expression.
Around the base of the pumpkin, there are a few autumn leaves scattered on the wooden surface, enhancing the seasonal atmosphere. The background is slightly out of focus, but you can see what might be the edge of a step or another surface above the pumpkin.
The lighting in the image is warm and atmospheric. The glow from inside the pumpkin is the main source of light, creating a contrast with the darker surroundings and emphasizing the Halloween theme. There's also a hint of ambient light, possibly from a distant source, adding depth to the scene.
Overall, the image captures the essence of Halloween, conveying a festive and slightly spooky mood through its use of traditional symbols and warm, inviting lighting.

These descriptions serve as a reference for comparison with the explanations given in the rating. Of course they are LLM generated themselves so may be imperfect, but it can still serve as a reference. 

In order to evaluate the truthfulness of the explanations, we use the RAGAS faithfulness score6. This score has limitations that we won’t get into here, but again can serve as a reference for comparison, measuring whether a statement (here the rating comments) is supported by a provided context passage (here the detailed image description). The chart below compares how the two systems did.

Figure: Faithfulness scores of the rater comments on each category. Faithfulness measures whether the comments are supported by the image characteristics.

Here, GPT-4o edges out Claude, particularly on skill where there is a material difference between GPT’s 0.63 and Claude’s 0.72. The scores range from about 0.63 to 0.77. Although technically a score of 1.0 indicates that the comments are fully supported by the description, in practice we typically find scores are lower, often for pedantic reasons, and its more useful comparing relative scores. All told, the rating comments are generally well supported and we don’t see any concerns with their truthfulness.

Conclusions

We built and evaluated data extraction systems using the structured output generation and vision capabilities of two LLMs. Both performed similarly at rating pumpkins. In our judgement there are no major issues with their truthfulness. And both are skilled judges of how scary a pumpkin is. In relation to originality and skill, their abilities are in question. Similar issues often arise out of ambiguity in the requests made to the models. It may be worth refining the prompts to more clearly explain what constitutes high and low skill and creativity.  

Check out our video walk through with Andrew!

1 https://huggingface.co/black-forest-labs/FLUX.1-schnell

2 https://platform.openai.com/docs/guides/function-calling

3 https://docs.anthropic.com/en/docs/build-with-claude/tool-use

4 https://platform.openai.com/docs/guides/vision

5 ttps://docs.anthropic.com/en/docs/build-with-claude/vision

6 https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html

Share this post

Safeguard your business with our AI Insurance

Get started today and be protected within two weeks.
Get in touch
ArrowArrow