Evaluating Claude 3 at converting screenshots to code

Claude 3 dropped yesterday, claiming to rival GPT-4 on a wide variety of tasks. I maintain a very popular open source project (41k+ Github stars) called “screenshot-to-code” (this one!) that uses GPT-4 vision to convert screenshots/designs into clean code. Naturally, I was excited to see how good Claude 3 was at this task!

Evaluation Setup

I don’t know of a public benchmark for the “screenshot to code” task so I created simple evaluation set up for the purposes of testing*:

Evaluation Dataset: 16 screenshots with a mix of UI elements, landing pages, dashboards and popular websites.
Evaluation Metric: Replication accuracy, as in “How close does the generated code look to the screenshot?” There are other metrics that are important for sure but this is by far the #1 thing that most developers and users of the repo care about. Each output is subjectively rated by a human (me!) on a rating scale from 0 to 4. 4 = very close to an exact replica while 0 = nothing like the screenshot. With 16 screenshots, the maximum any model can score is 64. I like to compare the percentage of the maximum possible score for each run.

*I’ve used it mostly for testing prompt variations until now since no model has come close to GPT-4 vision yet.

To make the evaluation process easy, I created a Python script that runs code for all the inputs in parallel. I also made a simple UI to do a side-by-side comparison of the input and output, shown below.

Results

Quick note about what kind of code we’ll be generating: currently, screenshot-to-code supports generating code in HTML + Tailwind, React, Vue, and several other frameworks. Stacks can impact the replication accuracy quite a bit. For example, because Bootstrap uses a relatively restrictive set of user elements, generations using Bootstrap tend to have a distinct "Bootstrap" style.

I only ran the evals on HTML/Tailwind here which is the stack where GPT-4 vision tends to perform the best.

Here are the results (average of 3 runs for each model):

GPT-4 Vision obtains a score of 65.10% - this is what we’re trying to beat
Claude 3 Sonnet receives a score of 70.31%, which is ~7.5% better.
Surprisingly, Claude 3 Opus which is supposed to be the smarter and slower model scores worse than both GPT-4 vision and Claude 3 Sonnet, coming in at 61.46%. Strange result.

Overall, a very strong showing for Claude 3. Obviously, there's a lot of subjectivity involved in this evaluation but Claude 3 is definitely on par with GPT-4 Vision, if not better.

You can see the side-by-side comparison for a run of Claude 3 Sonnet here. And for a run of GPT-4 Vision here.

Some other notes:

The prompts used are optimized for GPT-4 vision. I played around with adjusting the prompts for Claude and that did yield a small improvement. But it was nothing game-changing and potentially not worth the trade-off of maintaining two sets of prompts.
All the models excel at code quality - the quality is usually comparable to a human or better
Claude 3 is much less lazy than GPT-4 Vision. When asked to recreate Hacker News, GPT-4 Vision will only create two items in the list and leave comments in this code like  and .

But Claude 3 Sonnet can sometimes be lazy too but most of the time, does what you ask it to do!

For some reasons, all the models struggle with side-by-side "flex" layouts
Claude 3 Sonnet is a lot faster
Claude 3 gets background and text colors wrong quite often!
Claude 3 Opus likely just requires more prompting on my part

Overall, I'm very impressed with Claude 3 Sonnet as a multimodal model for this use case. I've added it as an alternative to GPT-4 Vision in the open source repo (hosted version update coming soon).

If you’d like to contribute to this effort, I have some documentation on running these evals yourself here. If you know of a good tool for running evals like this (image input, code output), it would save me a lot of effort from having to build the tool from scratch.

5.1 KiB Raw Blame History Unescape Escape

Evaluating Claude 3 at converting screenshots to code

Evaluation Setup

Results

5.1 KiB

Raw Blame History