Forum: TFSI

Claude just beat GPT-5, Gemini, and Grok in real-world job tasks,

From TechnologyDaily@1337:1/100 to All on Mon Sep 29 11:30:08 2025

Claude just beat GPT-5, Gemini, and Grok in real-world job tasks, according
to OpenAIs own study

Date:
Mon, 29 Sep 2025 10:13:04 +0000

Description:
According to OpenAI, Claude is the top AI model for getting actual work done

FULL STORY ======================================================================OpenAI has released GDPval, a new evaluation system to test how AI performs at work-related tasks Claude Opus 4.1 comes out in the lead, with 'ChatGPT-5 high' in second place Tasks include things like emailing a response to a dissatisfied customer

Were all familiar with AI benchmarks, which measure performance at certain tasks, but often these tasks dont reflect the real world and how people actually use AI, especially at work.

To combat this problem, OpenAI, the maker of ChatGPT , is introducing GDPval, a new way of measuring AI model performance using real-world work tasks compared to a real human across 44 occupations, from software developers and lawyers to registered nurses and mechanical engineers.

Surprisingly, the OpenAI study shows that the best performing model was Anthropics Claude Opus 4.1, which outpaced not only OpenAIs GPT-5 but also Gemini and Grok. GDPval win rate (Image credit: OpenAI)

This graph shows the overall GDPval win rate (the times when the AI did
better than an industry expert) and shows that Claude Opus 4.1 is out in the lead with a win rate of 47.6, with ChatGPT-5 high coming second with 38.8 and ChatGPT o3 high at 34.1. ChatGPT-4o scores the lowest, with a win rate of 12.4, which is significantly behind both Grok 4 and Gemini 2.5 Pro.

The study found that Claude was the highest-performing across eight of the nine industry sectors it tested, including government, health care, and
social assistance. The results clearly show that Claude Opus 4.1 leads
across a diverse range of work-related tasks. (Image credit: OpenAI)

Examples of the tasks include things like emailing a response to a dissatisfied customer requesting a return, optimizing a table layout for a Spring vendor fair, and auditing price inconsistencies in purchase orders. Whats in a name?

The name used by OpenAI, GDPval, comes from the concept of Gross Domestic Product (GDP) as a key economic indicator. OpenAI wants GPDval to be widely adopted to help ground conversations about future AI improvements in evidence rather than guesswork.

Releasing the results showing a competitor out in front appears to be an exercise in radical transparency by OpenAI, but that fits in perfectly with the company's philosophy. Our mission is to ensure that artificial general intelligence benefits all of humanity. As part of our mission, we want to transparently communicate progress on how AI models can help people in the real world, reads a statement from OpenAI .

The paper, which is available to read in its entirety online , comes a week after OpenAI released a more consumer-focused paper that showed that the majority of ChatGPT users (70%) were actually using it at home, rather than
at work.

The study was conducted by OpenAIs Economic Research team and Harvard economist David Deming for the National Bureau of Economic Research (NBER). The results were surprising to a lot of people, as previously, the focus of new ChatGPT releases has been very focused on work-related tasks like coding, making presentations, and being a good research tool.

The news that Claude Opus 4.1 is better at actual work-related tasks, not
just benchmarks, than even ChatGPT-5 high could mean a renewed focus by
OpenAI towards its changing user base. You might also like OpenAI responds to furious ChatGPT subscribers who accuse it of secretly switching to inferior models OpenAI reveals how people use ChatGPT, and the results might surprise you ChatGPTs new Pulse feature will help you manage your day with handy
visual updates

======================================================================
Link to news story: https://www.techradar.com/ai-platforms-assistants/claude/claude-just-beat-gpt- 5-gemini-and-grok-in-real-world-job-tasks-according-to-openais-own-study

--- Mystic BBS v1.12 A49 (Linux/64)
* Origin: tqwNet Technology News (1337:1/100)

Who's Online

System Info

Sysop:	CyberNix
Location:	London, UK
Users:	22
Nodes:	10 (0 / 10)
Uptime:	61:51:53
Calls:	911
Files:	5,126
D/L today:	20 files (3,113K bytes)
Messages:	761,384

Claude just beat GPT-5, Gemini, and Grok in real-world job tasks,

Who's Online

System Info