Post · Kakapo Social

Casey

@kc@chaos.social · 6 days ago

When you tell AI models on what specifically to look out for in a coding task…

…they repeatedly, consistently, just won't care. At all. Ever.

That's your "vibe coding“ for y'all.

Btw, I’m working on a benchmark for #a11y #accessibility stuff for „AI“.

AI models on average produce 50% less accessibility errors when told in the prompt to „make the UI accessible“. Telling them what to specifically look out for does not change that, and error rates are consistent with the simple prompt.

Casey

@kc@chaos.social · 6 days ago

Well, except for the flagship models of OpenAI and Anthropic, GPT 5.2 and Claude Opus 4.6. They perform SIGNIFICANTLY WORSE when expert guidance on how to build something is present.

What we're seeing at play here is synthetic data, and that synthetic data is *bad*.

AI is such a joke.

OpenAI gpt 5.2 and Anthropic Claude Opus 4.6 perform significantly worse when the prompt details what specifically to look out for

Robert Böttner

@robertobottoni@troet.cafe · 6 days ago

@kc das wird uns allen noch sowas von um die Ohren fliegen. 😱 Und nicht nur in der Softwareentwicklung.

Claudius Link

@realn2s@infosec.exchange · 6 days ago

@kc

WTF
Sadly the joke isn't funny at all

Do you have an explanation for this?

The regression could be caused by accessibility being generally underrepresented.
I would assume this representation to decline with the visibility of the projects. Meaning large well known projects contain more accessibility than obscure code snippets in the dark corners of the internet.

If this is the case an increase of the training data by scraping the last bit of code would lead to a statistically worse representation of accessibility

The worse performance with expert guidance is "interesting". It shows again the core problem of LLMs or any existing AI. It doesn't, and can't reason.
Nevertheless i would expect that providing the expert guidance would increase the statistical correlation to the intended outcome.
But I could also imagine that there is a threshold of underrepresentation. Below which the expert guidances are stronger correlated to random outcomes than to the intended outcome

Tongue in cheek, there is a simple solution

The AI competitors could "solve" this by increasing the representation of accessibility in the training data by financing a massive push for accesdibility.

That would be money well spent even when AI fails in the end. But I sadly don't expect it to happen

Casey

@kc@chaos.social · 6 days ago

@realn2s I have a broad idea of what's going on here, but I haven't verified it yet. I’m assuming it's that the models are "overthinking" the described guidelines, which leads to more complex outputs. However, data shows that outputs of these guided prompts, after reasoning, are generally shorter than outputs of those without. To verify this, I'll need a way to judge the complexity of the result, but that might be a far fetch for a project like this.

Casey

@kc@chaos.social · 6 days ago

One more big „oof“, or perhaps laugh, for tonight:

gpt-3.5-turbo - the model that ChatGPT launched with over three years ago, scored 68/100 points on that benchmark. It's also the highest score of any model tested. The current gpt-5.2 scores 22/100.

Remarkable.

Casey

@kc@chaos.social · 6 days ago

This was due to checking overall appearance of accessibility errors. While that is a good approach in general, newer models output significantly more tokens. I'm experimenting with changing the scoring to an "errors per 1000 output tokens" approach, but that'll have to wait a few days.

Casey

@kc@chaos.social · 6 days ago

I've spent the night on building a campaign site, benchmarking even more models, and tinkering with the score calculation. And I've been trying to understand what happens here.

I have to be somewhere in 5 hours though, and need sleep desperately.

Good night, fedi.

Average Accessibility Errors per UI task, consistent with my previous posts. Smaller, older OpenAI models far exceed newer generation models. The large midfield of models has unguided results between 3 and 4 barriers per task, with about half of that when any form of guidance is present.

Casey

@kc@chaos.social · 6 days ago

Also, one last time: Benchmarking these models in a useful manner cost me several hundred euros, and of the big, most expensive models, I've only tested GPT 5.2, Opus 4.6 and Kimi K2.5 as of now. Gemini 3.1 pro, Claude Sonnet and gpt-5.3-codex should also be tested before taking this to media outlets, but I can't afford that right now.

If you can, I’d really appreciate your financial support: https://steady.page/de/bye-bye-barrieren/about

Casey

@kc@chaos.social · 6 days ago

Why I’ve posted about this today: I have finalized the plan today after planning this out and writing prompts for the last couple of days where one commenter here said that you gotta tell AI to make stuff accessible, and I remembered the bullshit AI study of Aktion Mensch I've discussed a couple of weeks ago. I've started the model runs today, and I'm only a tiny single private researcher. So bear with me please, this will evolve further, like everything I do.

Shriram Krishnamurthi

@shriramk@mastodon.social · 6 days ago

@kc Is there any way to get access to the tasks you're giving, how you're evaluating, or any other detials? Thanks!

Casey

@kc@chaos.social · 6 days ago

@shriramk I’ll have a write-up ready soon.

However, because of overfitting, I cannot release the benchmark prompts and the complete methodology.