Update evaluating-claude.md
This commit is contained in:
parent
f2cb1a4dc3
commit
0245f2c359
@ -1,7 +1,5 @@
|
||||
# Evaluating Claude 3 at converting screenshots to code
|
||||
|
||||
[video running both]
|
||||
|
||||
Claude 3 dropped yesterday, claiming to rival GPT-4 on a wide variety of tasks. I maintain a very popular open source project (41k+ Github stars) called “screenshot-to-code” (this one!) that uses GPT-4 vision to convert screenshots/designs into clean code. Naturally, I was excited to see how good Claude 3 was at this task!
|
||||
|
||||
## Evaluation Setup
|
||||
@ -9,13 +7,16 @@ Claude 3 dropped yesterday, claiming to rival GPT-4 on a wide variety of tasks.
|
||||
I don’t know of a public benchmark for the “screenshot to code” task so I created simple evaluation set up for the purposes of testing\*:
|
||||
|
||||
- **Evaluation Dataset**: 16 screenshots with a mix of UI elements, landing pages, dashboards and popular websites.
|
||||
<img width="784" alt="Screenshot 2024-03-05 at 3 05 52 PM" src="https://github.com/abi/screenshot-to-code/assets/23818/c32af2db-eb5a-44c1-9a19-2f0c3dd11ab4">
|
||||
|
||||
- **Evaluation Metric**: Replication accuracy, as in “How close does the generated code look to the screenshot?” There are other metrics that are important for sure but this is by far the #1 thing that most developers and users of the repo care about. Each output is subjectively rated by a human (me!) on a rating scale from 0 to 4. 4 = very close to an exact replica while 0 = nothing like the screenshot. With 16 screenshots, the maximum any model can score is 64. I like to compare the percentage of the maximum possible score for each run.
|
||||
|
||||
\*I’ve used it mostly for testing prompt variations until now since no model has come close to GPT-4 vision yet.
|
||||
|
||||
To make the evaluation process easy, I created [a Python script](https://github.com/abi/screenshot-to-code/blob/main/backend/run_evals.py) that runs code for all the inputs in parallel. I also have a simple UI to do a side-by-side comparison of the input and output, shown below.
|
||||
To make the evaluation process easy, I created [a Python script](https://github.com/abi/screenshot-to-code/blob/main/backend/run_evals.py) that runs code for all the inputs in parallel. I also made a simple UI to do a side-by-side comparison of the input and output, shown below.
|
||||
|
||||

|
||||
|
||||
[video of input/output]
|
||||
|
||||
## Results
|
||||
|
||||
@ -29,18 +30,27 @@ Here are the results (average of 3 runs for each model):
|
||||
- Claude 3 Sonnet receives a score of **70.31%**, which is ~7.5% better.
|
||||
- Surprisingly, Claude 3 Opus which is supposed to be the smarter and slower model scores worse than both GPT-4 vision and Claude 3 Sonnet, coming in at **61.46%**. Strange result.
|
||||
|
||||
Overall, a very strong showing for Claude 3! Obviously, there's a lot of subjectivity involved in this evaluation but Claude 3 is definitely on par with GPT-4 Vision, if not better. It's also a lot less lazy than GPT-4 Vision.
|
||||
Overall, a very strong showing for Claude 3. Obviously, there's a lot of subjectivity involved in this evaluation but Claude 3 is definitely on par with GPT-4 Vision, if not better.
|
||||
|
||||
You can see the results for a run of Claude 3 Sonnet here. You can see the results for a run of GPT-4 Vision here.
|
||||
You can see the [side-by-side comparison for a run of Claude 3 Sonnet here](https://github.com/abi/screenshot-to-code-files/blob/main/sonnet%20results.png). And for [a run of GPT-4 Vision here](https://github.com/abi/screenshot-to-code-files/blob/main/gpt%204%20vision%20results.png).
|
||||
|
||||
Some other notes:
|
||||
|
||||
- The prompts used are optimized for GPT-4 vision. I played around with adjusting the prompts for Claude and that did yield a small improvement. Nothing game-changing and potentially not worth the trade-off of maintaining two sets of prompts.
|
||||
- All these models excel at code quality - the quality is usually comparable to a human or better
|
||||
- All the models seem to struggle with side-by-side layouts (the struggle is mostly in seeing, not in implementing) TODO: add iamge
|
||||
- Claude 3 is much less lazy than GPT-4 Vision. TODO: Hacker News example
|
||||
- Claude 3 Sonnet is faster and more steerable
|
||||
- The prompts used are optimized for GPT-4 vision. I played around with adjusting the prompts for Claude and that did yield a small improvement. But it was nothing game-changing and potentially not worth the trade-off of maintaining two sets of prompts.
|
||||
- All the models excel at code quality - the quality is usually comparable to a human or better
|
||||
- Claude 3 is much less lazy than GPT-4 Vision. When asked to recreate Hacker News, GPT-4 Vision will only create two items in the list and leave comments in this code like `<!-- Repeat for each news item -->` and `<!-- ... other news items ... -->`.
|
||||
<img width="699" alt="Screenshot 2024-03-05 at 9 25 04 PM" src="https://github.com/abi/screenshot-to-code/assets/23818/04b03155-45e0-40b0-8de0-b1f0b4382bee">
|
||||
|
||||
Overall, I'm very impressed with Claude 3 Sonnet as a multimodal model for this use case. I've added it as an alternative to GPT-4 Vision in the open source repo (hosted version update coming soon). I'm sure it's possible to get Opus to be quite good as well.
|
||||
But Claude 3 Sonnet can sometimes be lazy too but most of the time, does what you ask it to do!
|
||||
|
||||
<img width="904" alt="Screenshot 2024-03-05 at 9 30 23 PM" src="https://github.com/abi/screenshot-to-code/assets/23818/b7c7d1ba-47c1-414d-928f-6989e81cf41d">
|
||||
|
||||
- For some reasons, all the models struggle with side-by-side "flex" layouts
|
||||
<img width="1090" alt="Screenshot 2024-03-05 at 9 20 58 PM" src="https://github.com/abi/screenshot-to-code/assets/23818/8957bb3a-da66-467d-997d-1c7cc24e6d9a">
|
||||
- Claude 3 Sonnet is a lot faster
|
||||
- Claude 3 gets background and text colors wrong quite often!
|
||||
- Claude 3 Opus likely just requires more prompting on my part
|
||||
|
||||
Overall, I'm very impressed with Claude 3 Sonnet as a multimodal model for this use case. I've added it as an alternative to GPT-4 Vision in the open source repo (hosted version update coming soon).
|
||||
|
||||
If you’d like to contribute to this effort, I have some documentation on [running these evals yourself here](https://github.com/abi/screenshot-to-code/blob/main/Evaluation.md). If you know of a good tool for running evals like this (image input, code output), it would save me a lot of effort from having to build the tool from scratch.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user