Gen AI Writing Showdown

Comparing language model performance on creative writing transformations

Summary Scores by Model

Introduction

There is a GenAI Image Editing Showdown where image editing models are asked to transform images in certain ways, like adding hair onto a bald character, and the results are subjectively graded. Now that AI models are all so good, this strikes me as a better way to do evaluations. Only in contrived cases are we still testing whether the AI model can do something at all. The better question, the same as you might ask of someone you hire, is how good of a job can the model do?

A logical parallel task to image editing is editing text. Specifically here we look at creative text, and how well models can transform it - like changing the setting or the style, while keeping the other elements, particularly the images or feelings that a good piece of writing invokes, intact.

Methodology

I selected ten passages from books that I thought had some interesting properties, and then for each came up with a prompt requesting a transformation. I then prompted the ten models listed above with the passages and prompt. I used the default settings on OpenRouter for all of them, copied from the relevant Quickstart. I took the first response and didn't try and re-run or optimize any prompts.

For grading, I graded all of the responses on a four element scale: fail, ok, good, excellent. I did all the grading "blind" with the model names hidden to avoid bias. In some cases the model gave multiple options, I only ever looked at the first one. In some cases it provided explanations, I did not read these. Grading is subjective, and according to my taste. For each eval I have some notes about what I think makes a good one. The final scores assigned to the models are calculated by adding 0, 1, 2, or 3 points for each fail, ok, good, and excellent respectively.

Comments

All of the models are very good, so generally the criticism and the difference between fail and excellent is all in the margins. This is equally true in real-life writing. The superficial difference in skill between some random person on the street and a renowned author isn't that big. Both can probably express themselves, the edit difference might not be that different between what they would write, but there are orders of magnitude difference in the impact. So basically I think asking models to do writing tasks and then being really critical of them is a great way to assess their skill. This necessarily requires a lot of manual effort, both to evaluate and for a third party to verify.

Also worth noting, across all these models there is so much commonality in the results, it's unfortunately a little boring that we don't have more variety.

Loading evaluation data...

Grade Distribution by Model

Notes

The modal grade was "OK". I find this reasonable, the models are all quite good, and for the most part perform capable but unremarkable transformations. The order roughly reflects what one might expect from other benchmarks. At the time of the evaluation, Gemini 3 Pro was also the top on the Artificial Analysis Intelligence Index while e.g. GPT OSS and Qwen were at the lower end of the top models. Llama 3.3 (Llama 4 was not on OpenRouter) punches above its weight here, probably because it generates shorter replies which I favor. GPT 5.2 was the most interesting as it had the second highest number of "good" but also the second highest number of "fail" and so had a lower average score. It would be interesting to know if the performance is consistent across instances, e.g. it always fails in certain cases, or if running the same prompt through a few times is likely to generate some good results in most cases. Finally, Deepseek wins the "most boring" award, with almost all "ok" across the board.