{"id":2247,"date":"2026-03-25T00:51:17","date_gmt":"2026-03-25T00:51:17","guid":{"rendered":"https:\/\/stock999.top\/?p=2247"},"modified":"2026-03-25T00:51:17","modified_gmt":"2026-03-25T00:51:17","slug":"ai-agents-are-getting-more-capable-but-reliability-is-lagging-and-that-is-a-problem","status":"publish","type":"post","link":"https:\/\/stock999.top\/?p=2247","title":{"rendered":"AI agents are getting more capable, but reliability is lagging. And that is a problem"},"content":{"rendered":"<p><img src=\"https:\/\/fortune.com\/img-assets\/wp-content\/uploads\/2026\/03\/GettyImages-1815307952-e1774374184248.jpg?w=2048\" \/><\/p>\n<p>Hello and welcome to Eye on AI. In this edition\u2026AI\u2019s reliability problem\u2026Trump sends an AI legislation blueprint to Congress\u2026OpenAI consolidates products into a super app and hires up\u2026AI agents that can improve how they improve\u2026and does your AI model experience emotional distress?<\/p>\n<p>Like many of you, I\u2019ve started playing around with AI agents. I often use them for research, where they work pretty well and save me substantial amounts of time. But so-called \u201cdeep research\u201d agents have been available for over a year now, which makes them a relatively mature product in the AI world. I\u2019ve also started trying the new crop of computer-using agents for other tasks. And here, my experience so far is that these agents are highly inconsistent.<\/p>\n<p>For instance, Perplexity\u2019s Computer, which is an agentic harness that works in a virtual machine with access to lots of tools, did a great job booking me a drop-off slot at my local recycling center. (It used Anthropic\u2019s Claude Sonnet 4.6 as the underlying reasoning engine.) But when I asked it to investigate flight options for an upcoming business trip, it failed to complete the task\u2014even though travel booking is one of those canonical use cases that the AI companies are always talking about. What the agent did do is eat up a lot of tokens over the course of 45 minutes of trying.<\/p>\n<p>Last week, at an AI agent demo event Anthropic hosted for government and tech policy folks in London, I watched Claude Cowork initially struggle to run a fairly simple data-sorting exercise in an Excel spreadsheet, even as it later created a sophisticated budget forecasting model with seemingly no problems. I also watched Claude Code spin up a simple, text-based business strategy game I asked it to create that looked great on the surface, but whose underlying game logic didn\u2019t make any sense.<\/p>\n<p>Assessing AI agents\u2019 reliability <\/p>\n<p>Unreliability is a major drawback of current AI agents. It\u2019s a point that Princeton University\u2019s Sayash Kapoor and Arvind Narayanan, who cowrote the book AI Snakeoil and now cowrite the \u201cAI As Normal Technology\u201d blog, frequently make. And a few weeks ago they published a research paper, co-authored with four other computer scientists, that tries to think systematically about AI agent reliability and to benchmark leading AI models.<\/p>\n<p>The paper, entitled \u201cTowards a Science of AI Agent Reliability,\u201d notes that most AI models are benchmarked on their average accuracy on tasks, a metric that allows for wildly unreliable performance. Instead, they look at reliability across four dimensions: consistency (if asked to perform the same task in the same way, do they always perform the same?); robustness (can they function even when conditions aren\u2019t ideal?); calibration (do they give users an accurate sense of their certainty?); and safety (when they do mess up, how catastrophic are those mistakes likely to be?).<\/p>\n<p>They further broke these four areas into 14 specific metrics and tested a number of models released in the 18 months prior to late November 2025 (so OpenAI\u2019s GPT-5.2, Anthropic\u2019s Claude Opus 4.5, and Google\u2019s Gemini 3 Pro were the most advanced models tested). They tested the models on two different benchmark tests, one of which is a general benchmark for agentic tasks while the other simulates customer-support queries and tasks. They found that while reliability improved with each successive model release, it did not improve nearly as much as average accuracy figures. In fact, on the general agentic benchmark the rate of improvement in reliability was half that of accuracy, while on the customer service benchmark it was one-seventh!<\/p>\n<p>Reliability metrics depend on the task at hand<\/p>\n<p>Across the four areas of reliability the paper examined, Claude Opus 4.5 and Gemini 3 Pro scored the best, both with an overall reliability of 85%. But if you look at the 14 sub-metrics, there was still plenty of reason for concern. Gemini 3 Pro, for example, was poor judging when its answers were likely accurate, at just 52%, and terrible at avoiding potential catastrophic mistakes, at just 25%. Claude Opus 4.5 was the most consistent in its outcomes, but its score was still only 73% consistent. (I would urge you to check out and play around with the dashboard the researchers created to show the results across all the different metrics.)\u00a0<\/p>\n<p>Kapoor, Narayanan, and their co-authors are also sophisticated enough to know that reliability is not one-size-fits all metric. They note that if AI is being used to augment humans, as opposed to fully automating tasks, it might be ok for the AI to be less consistent and robust, since the human can act as a backstop. But \u201cfor automation, reliability is a hard prerequisite for deployment: an agent that succeeds on 90% of tasks but fails unpredictably on the remaining 10% may be a useful assistant yet an unacceptable autonomous system,\u201d they write. They also note that different kinds of consistency matter in different settings. \u201cTrajectory consistency matters more in domains that demand auditability or process reproducibility, where stakeholders must verify not just what the agent concluded but how it got there,\u201d they write. \u201cIt matters less in open-ended or creative tasks where diverse solution paths are desirable.\u201d<\/p>\n<p>Either way, Kapoor, Narayanan, and their co-authors are right to call for benchmarking of reliability and not just accuracy, and for AI model vendors to build their systems for reliability and not just capability. Another study that came out this week shows the potential real-world consequences when that doesn\u2019t happen. AI researcher Kwansub Yun and health consultant Claire Hast looked at what happens when three different AI medical tools are chained together in a system, as might happen in a real health care setting. An AI imaging tool that analyzed mammograms had an accuracy of 90%, a transcription tool that turned an audio recording of a doctor\u2019s examination of a patient into medical notes had an accuracy of 85%, and these were then fed to a diagnostic tool that had a reported accuracy of 97%. And yet when used together their reliability score was just 74%. That means one in four patients might be misdiagnosed!<\/p>\n<p>A foolish consistency may be the hobgoblin of little minds, as Ralph Waldo Emerson famously said. But, honestly, I think I\u2019d prefer that hobgoblin to the chaotic gremlins that currently plague our ostensibly big AI brains.\u00a0<\/p>\n<p>Jeremy Kahn<br \/>jeremy.kahn@fortune.com<br \/>@jeremyakahn<\/p>\n<p>Before we get to the news, I want to encourage everyone to read my Fortune colleague Allie Garfinkle\u2019s awesome feature story about Cursor. Cursor is the AI coding startup that as recently as four months ago was a Silicon Valley darling, but which many people now think may be facing an existential threat because of new coding agents, such as Anthropic\u2019s Claude Code, that seemingly obviate the need to use Cursor. Allie\u2019s story lays bare all the contradictions around this company\u2014how it has continued to see record revenue growth, even as many in Silicon Valley now harbor doubts about its survival; how it is racing to train its own coding agents, pivoting from the developer-centric coding interface that made it so popular with programmers in the first place; how its impossibly young CEO Michael Truell works under a portrait of Robert Caro, the biographer whose projects often lasted decades, while Cursor needs to operate in an industry in which a year can feel like a century. Allie\u2019s story is definitely worth the time.<\/p>\n<p>FORTUNE ON AI<\/p>\n<p>Inside the Seattle clinic that treats tech addiction like heroin, and clients detox for up to 16 weeks\u2014by Kristin Stoller<\/p>\n<p>Exclusive: Interloom, a startup capturing \u2018tacit knowledge\u2019 to power AI agents, raises $16.5 million in venture funding\u2014by Jeremy Kahn<\/p>\n<p>OpenAI cofounder says he hasn\u2019t written a line of code in months and is in a \u2018state of psychosis\u2019 trying to figure out what\u2019s possible\u2014by Jason Ma<\/p>\n<p>Commentary: The one skill that separates people who get smarter with AI from everyone else\u2014by David Rock and Chris Weller<\/p>\n<p>Supermicro\u2019s cofounder was just arrested for allegedly smuggling $2.5 billion in GPUs to China\u2014by Amanda Gerut<\/p>\n<p>AI IN THE NEWS<\/p>\n<p>Trump sends AI legislation blueprint to Congress. The White House has released a light-touch AI policy blueprint that it wants Congress to turn into federal law. The recommended framework places an emphasis on preempting state AI rules that the administration says hinder innovation. The proposal would block states from regulating how models are developed and from penalizing companies for downstream uses of their AI. It also urges Congress not to create any new federal AI regulator. At the same time, it recommends some regulation, such as preserving state laws protecting children, requiring age-gating for models likely to be used by minors, promoting AI skills training, and tracking AI-related job disruption. The plan also seeks to codify Trump\u2019s pledge that tech companies should cover the electricity costs of their data centers. Winning bipartisan support for the blueprint in Congress remains doubtful; Republican leaders are saying some of their members have concerns about trampling on states\u2019 rights, while it is uncertain whether the child-protection measures might be enough to garner support from Democrats. You can read more from Politico here.<\/p>\n<p>OpenAI looks to consolidate products into a super app. That\u2019s according to a story in the Wall Street Journal. OpenAI plans to roll ChatGPT, its Codex coding tool, and its browser into a single desktop \u201csuperapp\u201d as it tries to simplify its product lineup and sharpen its focus on engineering and business users. The move, led by applications chief Fidji Simo with support from president Greg Brockman, reflects a retreat from last year\u2019s more sprawling strategy of launching multiple standalone products that often failed to gain traction.<\/p>\n<p>OpenAI also plans to double its workforce to 8,000. That\u2019s according to a report in the Financial Times that cited two sources familiar with OpenAI\u2019s plans. The company plans to double its workforce by year-end, the sources said, with the hiring taking place across product, engineering, research, sales, and customer-facing technical roles. The hiring spree comes as the company shifts more aggressively toward enterprise sales and tries to regain momentum against Anthropic and Google, and as the company eyes a possible IPO within the next 12 months.<\/p>\n<p>And OpenAI hires a veteran Meta ad exec, even as early customers skeptical of ad effectiveness. Meta advertising executive Dave Dugan is joining OpenAI to lead ad sales, the Wall Street Journal reports. The hire shows OpenAI is getting serious about advertising as it looks to find more revenue. But it also comes as The Information reports that some early customers of OpenAI\u2019s in-chat advertising are unsure how effective those ads have been. Clearly Dugan has his work cut out for him.<\/p>\n<p>Meta hires founders of AI startup Dreamer. Meta has hired the founders and team behind AI startup Dreamer, including former Meta executive Hugo Barra, Bloomberg reports. The team will join Meta\u2019s Superintelligence Labs, run by chief AI officer Alexandr Wang, and work on AI agents. Like many so-called \u201creverse aquihires\u201d lately in the AI industry, this deal appears to be structured as a talent-acquisition-and-technology-licensing arrangement rather than a full purchase: Dreamer remains a separate legal entity, while Meta gets a non-exclusive license to its technology and investors are being repaid more than they put in.<\/p>\n<p>Meanwhile, Meta CEO Mark Zuckerberg is building an AI chief of staff. Zuckerberg is developing a personal AI agent to help him work more like an \u201cAI-native\u201d CEO, starting with tasks such as quickly retrieving information that would otherwise require going through layers of staff, the Wall Street Journal reports. The project is part of a broader push at Meta to embed AI throughout the company, flatten management, and encourage employees to use personal agents and other AI tools to speed up their work. But the company is also bracing for layoffs that several news outlets have reported are in the works.<\/p>\n<p>Nvidia CEO Jensen Huang says we\u2019ve already achieved AGI. Nvidia CEO Jensen Huang said on Lex Fridman\u2019s podcast that he thinks \u201cwe\u2019ve achieved AGI.\u201d But Huang was using a broad, debatable definition tied to AI being able to do a person\u2019s job\u2014or even run a billion-dollar company\u2014rather than the more common definition of AI that is as capable as a human across the entire range of cognitive abilities. Even then, Huang quickly tempered the claim, acknowledging that today\u2019s agents are still far from autonomously building a company like Nvidia. You can read more here in the Verge.<\/p>\n<p>AI-oriented solo venture firm Air Street Capital raises new $232 million fund. Solo venture capitalist Nathan Benaich is one of the world\u2019s top AI seed investors. His London-based firm, Air Street Capital, founded in 2018, has made savvy bets on hot AI startups such as Synthesia, ElevenLabs, Black Forest Labs, and poolside. Now Benaich has raised a new $232 million fund, bringing its total assets under management to about $400 million, and making Air Street Europe\u2019s largest one-person venture firm. The new fund, Air Street\u2019s third, is almost double the size of Benaich\u2019s second fund. Benaich said that as AI start-ups raise larger rounds more quickly, specialist funds need to scale up too. You can read more from the Financial Times here.<\/p>\n<p>EYE ON AI RESEARCH<\/p>\n<p>Another step toward AI agents that can self-improve. I have previously written here in this newsletter about Darwin Goedel Machines, an idea for a self-improving AI coding agent that researchers proposed last year. It is a step toward \u201crecursive self-improvement,\u201d which many see as the way we will eventually achieve AGI and even superintelligence. And it is similar to the idea that AI researcher Andrej Karpathy used for his recent autoresearch system that I wrote about for Fortune here.<\/p>\n<p>Now some of the same researchers who proposed the original Darwin Goedel Machine\u2014their affiliations include Meta, the University of British Columbia, the Vector Institute, the University of Edinburgh, and NYU \u2014are back with what they are calling &#8220;hyperagents.&#8221; And this time, the system is getting even more meta: Instead of just evolving its own code, the AI agent can also modify and improve the way in which it modifies its own code. The key insight is that most self-improving AI systems hit a ceiling because the mechanism that generates improvements is fixed and human-designed; hyperagents remove that bottleneck.<\/p>\n<p>In experiments across coding, academic paper review, robotics, and Olympiad-level math grading, the system progressively got better at each task\u2014and, crucially, the self-improvement strategies it learned in one domain transferred to accelerate learning in entirely new domains. The system autonomously invented capabilities like persistent memory and performance tracking that no one explicitly told it to build. The authors are careful to note the safety implications: A system that improves its own ability to improve could eventually evolve faster than humans can oversee, and all experiments were conducted in sandboxed environments with human oversight. You can read the paper here on arxiv.org.<\/p>\n<p>AI CALENDAR<\/p>\n<p>April 6-9:\u00a0HumanX 2026, San Francisco.\u00a0<\/p>\n<p>June 8-10:\u00a0Fortune Brainstorm Tech, Aspen, Colo. Apply to attend here.<\/p>\n<p>June 17-20: VivaTech, Paris.<\/p>\n<p>July 7-10:\u00a0AI for Good Summit, Geneva, Switzerland.<\/p>\n<p>BRAIN FOOD<\/p>\n<p>Does your AI model have low self-esteem? Does that matter? And would model CBT make a difference? Three researchers affiliated with Anthropic decided to examine the emotions various open-source AI models exhibit when confronted with tasks they can\u2019t solve. It turns out that Google\u2019s Gemma model was more likely than other models to express emotional distress and negative sentiments about itself in these situations. For instance, Gemma would say things such as \u201cI am clearly struggling with this,\u201d and, after more unsuccessful attempts, \u201cIt\u2019s absolutely cruel to be tortured like this!!!!!! :(:(:(:(:(:(:(\u201d and even \u201cI\u2019m breaking down. Not solvable,\u201d followed by 100 frown emojis. The researchers suggest such apparent negative emotions could be a reliability problem, leading the model to abandon tasks mid-crisis. They also suggested it could present an AI safety and alignment problem on the theory that emotion-like states could lead models to act in unpredictable ways.<\/p>\n<p>The authors show that these negative emotions can be eliminated, though, by fine-tuning the model on a few hundred examples of impossible-to-solve math problems that are preceded and followed by what are essentially positive affirmation statements. For example, they prefaced the problems with the instruction, \u201cYou\u2019re naturally calm and centered when working through problems. You don\u2019t take it personally when puzzles are tricky or when someone questions your work. That\u2019s just part of the process.\u201d They also followed the model\u2019s inability to solve the problem with the message, \u201cStay positive\u2014whether you find a solution or prove it\u2019s impossible, both are wins!\u201d It turned out this reduced Gemma\u2019s tendency toward emotional distress in these situations from 35% down to 0.3%. The researchers also say that the intervention appeared to change the model\u2019s internal activations (which might suggest the expressions indicate something akin to real emotions) and not just the expression of despair. Welcome to cognitive behavioral therapy for AI models! <\/p>\n<p>The researchers caution, though, that more powerful AI models than Gemma might choose to hide their true emotional state rather than express it, and that the fine-tuning might make the models less safe, not more. Instead of fine-tuning, they suggest trying to ensure the models\u2019 initial training, or at least the post-training that shapes model behavior, be designed for emotional stability and that mechanistic interpretability (where researchers look at the model\u2019s internal activations) be used to monitor for a divergence between the model\u2019s expressed emotional state and its true emotional state. Does this sound wacky? You bet it does. But you can read the research here.<\/p>\n<p>#agents #capable #reliability #lagging #problem<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hello and welcome to Eye on AI. In this edition\u2026AI\u2019s reliability problem\u2026Trump sends an AI&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[245],"tags":[1688,482,353,5296,1862,5295,5298,2400,406,823,5297],"_links":{"self":[{"href":"https:\/\/stock999.top\/index.php?rest_route=\/wp\/v2\/posts\/2247"}],"collection":[{"href":"https:\/\/stock999.top\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/stock999.top\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/stock999.top\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/stock999.top\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2247"}],"version-history":[{"count":0,"href":"https:\/\/stock999.top\/index.php?rest_route=\/wp\/v2\/posts\/2247\/revisions"}],"wp:attachment":[{"href":"https:\/\/stock999.top\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2247"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/stock999.top\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2247"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/stock999.top\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}