New Scale AI benchmark reveals AI agents rarely meet professional work standards
Companies are deploying AI agents faster than the research can keep up. But a new benchmark from Scale AI and the Center for AI Safety offers a rare dose of empirical cold water. When it comes to completing real-world, professional work from start to finish, even the most advanced AI agents succeed less than five per cent of the time.
That’s the central finding of the Remote Labor Index (RLI), a benchmark developed jointly by Scale AI and the Center for AI Safety that measures how well AI agents perform on real, paid digital work tasks drawn from professional freelance platforms. At its launch in late 2025, the top-performing agent automated just 2.5% of projects to a professional standard. By mid-2026, that number has barely moved.
“After six months we are still seeing less than 5%,” said Udari Madhushani Sehwag, security and policy research lead at Scale AI in San Francisco and a contributor to the RLI research. “They started very low initially and we are still in this very low region.”
Measuring what actually matters
Most AI benchmarks test isolated skills like answering questions, writing code snippets, and summarizing text. The RLI was built on a different premise. The research team wanted to know whether an AI agent could take a task from beginning to end the way a paid professional would, and whether the output would meet a paying client’s standard.
Tasks were sourced from digital labor platforms like Upwork and spanned 23 sectors, including video editing, logo and leaflet design, architecture, data analysis, jewelry design, and game development. Evaluators then compared AI-generated deliverables against human-produced ones with one question in mind. Would a client actually pay for this?
“If you consider creating a window and you have designs and all that, it could be the case that AI can create something very aesthetically pleasing,” Sehwag said. “But if the dimensions are incorrect, it doesn’t matter how pleasing it looks. A human is not going to actually pay for that.”
The benchmark also tracks a live leaderboard of AI agent performance scores, updated as new models are evaluated. The top performer as of mid-2026, claude-opus-4-6 via the CoWork platform, sits at 4.17%. Everything else is lower.
The reliability gap
The low automation rate isn’t simply a matter of AI agents producing bad work. Sehwag points to something more specific.
“The keyword is reliability,” she said. “They can complete parts of the tasks, but for the most part they’re not able to complete the end to end tasks reliably.”
There were pockets of stronger performance, however. In image generation tasks, such as logo creation, AI outputs sometimes outperformed human work in evaluator preference. Sehwag attributes this partly to the subjectivity baked into that type of work. Models trained on vast visual datasets can produce aesthetically polished results that sway human judgment. But those results don’t extend to more complex, specification-driven work.
The slow pace of improvement also stood out. Other AI benchmarks, which test discrete skills, tend to show steep progress curves. The RLI, grounded in end-to-end task completion, shows something flatter.
It’s a finding that runs counter to the rapid organizational momentum behind AI agents. A Salesforce survey of 200 chief human resources officers (CHROs) covered by HRD America found that 89 per cent believe AI agents will empower them to reassign employees to new roles, and roughly 23 per cent of the workforce is expected to be redeployed as a result of the technology.
Augmentation, not automation
So what should organizations actually do with this information? According to Sehwag, decisions should be based on what’s demonstrably true, not on projections about what AI will eventually be able to do.
“Decisions should be based on what we can see and the proof that exists, not based on the projections we have in mind,” she said.
That doesn’t mean avoiding AI agents altogether. Sehwag sees real value in using them to accelerate work that still gets reviewed by a person. A task that used to take 30 minutes might take 10 with an agent’s help.
“You can use these agents more in the form of a copilot where they can actually help you complete the task more efficiently and faster, but not as a replacement for you that they would actually be able to complete the whole task reliably,” she said.
It’s a meaningful distinction for organizations still weighing how far to take their AI deployments. “They can do augmentation, but not automation as of now,” she said.
That framing matters for organizations that have already woven AI agents into client-facing workflows. Sehwag’s advice isn’t to pull back, but to maintain human oversight at every stage.
How long that oversight will be necessary depends on how quickly AI agents improve.
“I wouldn’t expect to see a rapid jump,” Sehwag said. “And this is what we’ve been seeing since late 2025 as well.”
What leaders should watch
The RLI is an active benchmark, not a one-time study. New models are being added, and Scale AI continues to research what’s driving the performance plateau.
Sehwag points to three interconnected gaps. Agents need to fully understand a task brief, complete all component parts, and then assemble those parts into a coherent whole. Until all three click into place consistently, full end-to-end automation remains out of reach. That has direct implications for any organization weighing workforce decisions, whether on hiring, role redesign, or the scope of what gets delegated to an AI system.
The RLI’s findings land at a moment when many organizations are still figuring out where AI agents fit into their workforce strategy. Research covered by HRD America suggests employees are already drawing clear lines around what they’ll accept from these tools, and the data suggests those instincts are well-founded. For organizations that have approached AI as a strategic workforce question rather than a technology one, the RLI is a timely reality check on AI agents’ current capabilities.