screenshot-to-code/blog/evaluating-claude.md
2024-03-05 21:35:38 -05:00

57 lines
5.1 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Evaluating Claude 3 at converting screenshots to code
Claude 3 dropped yesterday, claiming to rival GPT-4 on a wide variety of tasks. I maintain a very popular open source project (41k+ Github stars) called “screenshot-to-code” (this one!) that uses GPT-4 vision to convert screenshots/designs into clean code. Naturally, I was excited to see how good Claude 3 was at this task!
## Evaluation Setup
I dont know of a public benchmark for the “screenshot to code” task so I created simple evaluation set up for the purposes of testing\*:
- **Evaluation Dataset**: 16 screenshots with a mix of UI elements, landing pages, dashboards and popular websites.
<img width="784" alt="Screenshot 2024-03-05 at 3 05 52PM" src="https://github.com/abi/screenshot-to-code/assets/23818/c32af2db-eb5a-44c1-9a19-2f0c3dd11ab4">
- **Evaluation Metric**: Replication accuracy, as in “How close does the generated code look to the screenshot?” There are other metrics that are important for sure but this is by far the #1 thing that most developers and users of the repo care about. Each output is subjectively rated by a human (me!) on a rating scale from 0 to 4. 4 = very close to an exact replica while 0 = nothing like the screenshot. With 16 screenshots, the maximum any model can score is 64. I like to compare the percentage of the maximum possible score for each run.
\*Ive used it mostly for testing prompt variations until now since no model has come close to GPT-4 vision yet.
To make the evaluation process easy, I created [a Python script](https://github.com/abi/screenshot-to-code/blob/main/backend/run_evals.py) that runs code for all the inputs in parallel. I also made a simple UI to do a side-by-side comparison of the input and output, shown below.
![Google Chrome](https://github.com/abi/screenshot-to-code/assets/23818/38126f8f-205d-4ed1-b8cf-039e81dcc3d0)
## Results
Quick note about what kind of code well be generating: currently, screenshot-to-code supports generating code in HTML + Tailwind, React, Vue, and several other frameworks. Stacks can impact the replication accuracy quite a bit. For example, because Bootstrap uses a relatively restrictive set of user elements, generations using Bootstrap tend to have a distinct "Bootstrap" style.
I only ran the evals on HTML/Tailwind here which is the stack where GPT-4 vision tends to perform the best.
Here are the results (average of 3 runs for each model):
- GPT-4 Vision obtains a score of **65.10%** - this is what were trying to beat
- Claude 3 Sonnet receives a score of **70.31%**, which is ~7.5% better.
- Surprisingly, Claude 3 Opus which is supposed to be the smarter and slower model scores worse than both GPT-4 vision and Claude 3 Sonnet, coming in at **61.46%**. Strange result.
Overall, a very strong showing for Claude 3. Obviously, there's a lot of subjectivity involved in this evaluation but Claude 3 is definitely on par with GPT-4 Vision, if not better.
You can see the [side-by-side comparison for a run of Claude 3 Sonnet here](https://github.com/abi/screenshot-to-code-files/blob/main/sonnet%20results.png). And for [a run of GPT-4 Vision here](https://github.com/abi/screenshot-to-code-files/blob/main/gpt%204%20vision%20results.png).
Some other notes:
- The prompts used are optimized for GPT-4 vision. I played around with adjusting the prompts for Claude and that did yield a small improvement. But it was nothing game-changing and potentially not worth the trade-off of maintaining two sets of prompts.
- All the models excel at code quality - the quality is usually comparable to a human or better
- Claude 3 is much less lazy than GPT-4 Vision. When asked to recreate Hacker News, GPT-4 Vision will only create two items in the list and leave comments in this code like `<!-- Repeat for each news item -->` and `<!-- ... other news items ... -->`.
<img width="699" alt="Screenshot 2024-03-05 at 9 25 04PM" src="https://github.com/abi/screenshot-to-code/assets/23818/04b03155-45e0-40b0-8de0-b1f0b4382bee">
But Claude 3 Sonnet can sometimes be lazy too but most of the time, does what you ask it to do!
<img width="904" alt="Screenshot 2024-03-05 at 9 30 23PM" src="https://github.com/abi/screenshot-to-code/assets/23818/b7c7d1ba-47c1-414d-928f-6989e81cf41d">
- For some reasons, all the models struggle with side-by-side "flex" layouts
<img width="1090" alt="Screenshot 2024-03-05 at 9 20 58PM" src="https://github.com/abi/screenshot-to-code/assets/23818/8957bb3a-da66-467d-997d-1c7cc24e6d9a">
- Claude 3 Sonnet is a lot faster
- Claude 3 gets background and text colors wrong quite often!
- Claude 3 Opus likely just requires more prompting on my part
Overall, I'm very impressed with Claude 3 Sonnet as a multimodal model for this use case. I've added it as an alternative to GPT-4 Vision in the open source repo (hosted version update coming soon).
If youd like to contribute to this effort, I have some documentation on [running these evals yourself here](https://github.com/abi/screenshot-to-code/blob/main/Evaluation.md). If you know of a good tool for running evals like this (image input, code output), it would save me a lot of effort from having to build the tool from scratch.