OpenAI’s GPT-5 rollout is not going smoothly

Want smarter insights in your inbox? Sign up for our weekly s to get only what matters to enterprise AI, data, and security leaders.Subscribe Now

Thelaunch of OpenAI’s long anticipated new model, GPT-5, isoff to a rocky startto say the least.

Evenforgiving errors in chartsandvoice demoesduring yesterday’s livestreamed presentation of the new model (actually four separate models, and a ‘Thinking’ mode that can be engaged for three of them), anumber of user reports have emerged since GPT-5’s release showing it erring badlywhen solving relatively simple problems that preceding OpenAI models — and rivals from competing AI labs — answer correctly.

For example, data scientistColin Fraser posted screenshotsshowingGPT-5 getting a math proof wrong (whether 8.888 repeating is equal to 9 — it is of course, not).

Wow, I was just playing around before but it actually is stupid pic.twitter.com/ao51nOH0Ui— Colin Fraser (@colin_fraser) August 8, 2025

Wow, I was just playing around before but it actually is stupidpic.twitter.com/ao51nOH0Ui

It alsofailed on a simple algebra arithmeticproblemthat elementary schoolers could probably nail, 5.9 = x + 5.11.

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead:

This is concerning. Benjamin De Kraker (@BenjaminDEKR) August 8, 2025

UsingGPT-5 to judge OpenAI’s own erroneous presentation charts also did not yield helpful or correct responses.

Q. Prove using an LLM-as-a-judge still doesn't workA. pic.twitter.com/KnCK5Xs9ja— Kangwook Lee (@Kangwook_Lee) August 7, 2025

Q. Prove using an LLM-as-a-judge still doesn't workA.pic.twitter.com/KnCK5Xs9ja

It also failed onthis trickier math word problem below(which, to be fair, stumped this human at first…though Elon Musk’s Groq 4 AI answered it correctly. For a hint, think of the fact that flagstones in this case can’t be divided into smaller portions. They must remain in tact as 80 separate units, so no halves or quarters).

Careful not to cut yourself on the jagged frontier pic.twitter.com/buJGgJ6baI— Greg Burnham (@GregHBurnham) August 8, 2025

Careful not to cut yourself on the jagged frontierpic.twitter.com/buJGgJ6baI

Even though OpenAI’s internal benchmarks and some third-party external ones have shown GPT-5 to outperform all other models at coding,it appears that in real world usage, Anthropic’s recently updated Claude Opus 4.1 seems to do a better job at “one-shotting” certain tasks, that is, completing the user’s desired application or software build to their specifications. Seean example below from developer Justin Sun posted to X:

Opus 4.1's one-shot attempt at "create a 3d capybara petting zoo" – 8 minutes totalThis was honestly pretty insane, not only are the capybaras way cuter and moving, there are individual pet affinity levels, a day/night switcher, feeding, and even a screenshot feature pic.twitter.com/FiKTO3FKK4— justin (@justinsunyt) August 7, 2025

Opus 4.1's one-shot attempt at "create a 3d capybara petting zoo" – 8 minutes totalThis was honestly pretty insane, not only are the capybaras way cuter and moving, there are individual pet affinity levels, a day/night switcher, feeding, and even a screenshot featurepic.twitter.com/FiKTO3FKK4

Unfortunately,OpenAI is slowly deprecating those older models — including the former default GPT-4o and the powerful reasoning model o3— for users of ChatGPT, though they’ll continue to be available in the application programming interface (API) for developers for the foreseeable future.

In addition, a report from security firm SPLXfound that OpenAI’s internal safety layer left major gaps in areas like business alignment and vulnerability to prompt injection and obfuscated logic attacks. 

While anecdotal, the checking the temperature on how the model is faring with early AI adopters seems to indicate a chilly reception.

AI influencer and former Googler Bilawal Sidhu posted a pollon X asking for a “vibe check” from his followers and the wider userbase, and so far, with 172 votes in, theoverwhelming response is “Kinda mid.”

Alright, GPT-5 vibe check— Bilawal Sidhu (@bilawalsidhu) August 7, 2025

And as thepseudonymous AI Leaks and News account wrote,“The overwhelming consensus on GPT-5 from both X and the Reddit AMA are overwhelmingly negative.”

The overwhelming consensus on GPT-5 from both X and the Reddit AMA are overwhelmingly negativeMost users are disgruntled about the broken model picker and non-pro users not having access to legacy modelsWhat are your initial thoughts on GPT-5?— AI Leaks and News (@AILeaksAndNews) August 8, 2025

The overwhelming consensus on GPT-5 from both X and the Reddit AMA are overwhelmingly negativeMost users are disgruntled about the broken model picker and non-pro users not having access to legacy modelsWhat are your initial thoughts on GPT-5?

Tibor Blaho, lead engineer at AIPRM and a popular AI leaks and news poster on X, summarized the many problems with theChatGPT-5 rollout in an excellent post, highlighting that one of the new marquee features— an automatic “router” in ChatGPT that chooses a thinking or non-thinking mode for the underlying GPT-5 model depending on the difficulty of the query — has become one of the chief complaints,given the model seemed to default to non-thinking mode for many users.

A bit sad how the GPT-5 launch is going so far, especially after the long wait and high expectations– The automatic switching between models (the router) seems partly broken/unreliable– It's unclear exactly which model you're actually interacting with (standard or mini,…— Tibor Blaho (@btibor91) August 8, 2025

A bit sad how the GPT-5 launch is going so far, especially after the long wait and high expectations– The automatic switching between models (the router) seems partly broken/unreliable– It's unclear exactly which model you're actually interacting with (standard or mini,…

Thus, thesentiment toward ChatGPT-5 is far from universally positive, highlighting a major problem for OpenAIas it faces increasing competition from major U.S. rivals like Google and Anthropic, and a growing list of free, open source and powerful Chinese LLMs offering features that many U.S. models lack.

Take theAlibaba Qwen Team of AI researchers,whojust today updated their highly performant Qwen 3 model to have 1 million token contextgiving users the ability to exchange nearly 4x as much information with the model in a single back/forth interaction as GPT-5 offers.

Given OpenAI’s other big release this week — that ofnew open source gpt-oss models— also received a mixed reception from early users, things are not looking up for the number one dedicated AI company by users right now (700 million weekly active users of ChatGPTas of this month).

Other power users likeOtherside AI co-founder and CEO Matt Schumer, who received early access to GPT-5 andblogged about it favorably in a review here,opined that views would shift as more people figured out the best ways to use the new model and adjusted their integration approaches:

A lot of folks who are having a bad experience are using GPT-5 in agent harnesses that aren't yet optimized for it.For every new model release, there's a time lag between release + when companies that integrate the model have it truly working well.Agent companies rush to…— Matt Shumer (@mattshumer_) August 8, 2025

A lot of folks who are having a bad experience are using GPT-5 in agent harnesses that aren't yet optimized for it.For every new model release, there's a time lag between release + when companies that integrate the model have it truly working well.Agent companies rush to…

While it’s still early days for GPT-5 — and the sentiment could change dramatically as more users get their hands on it and try it for different tasks — theearly indications are not looking like this is a “home run” release for OpenAIin the same way that prior releases such as GPT-4, or even the newer 4o and o3, were. And that’s a concerning indicator fora company that just raised yet another funding round, yet remains unprofitable due to its high costs of research and development.

Source: Venturebeat

Scroll to Top