HR owns AI governance now. Which LLM is good at what?

HR is being handed governance responsibility for tools that in some cases it doesn't fully understand, and the exposure that creates is landing in places most people functions aren't prepared for

By Stephen Owens

22 Jun 2026

Lydia Wu, Senior Director and Industry Analyst of AI in HR at Gartner, puts the number plainly. "About 95% of HR teams have AI initiatives underway, but only 18% are achieving significant transformational value,". That is not a rollout problem. HR teams may be deploying tools they don't have a deep enough operational understanding of to govern, and the consequences are showing up in the data.

79% of employers are concerned about AI-related litigation, according to Littler's 2026 cross-functional AI adoption report and 77% say AI adoption is already outpacing their current governance capabilities. And in another finding that should stop any CHRO mid-coffee: 51% of Americans are already using AI, yet fewer than one in four trust what it produces most or almost all of the time. A substantial portion of the American workforce is regularly using a tool it doesn't trust to inform business decisions.

Niloy Ray, co-chair of Littler's AI and Technology Practice Group, framed the liability directly: "Employers will likely remain on the hook for how these tools are used."

Whether HR is ready to be on that hook is a different question.

Measuring adoption was the wrong instinct

Most organizations defaulted to measuring AI success by adoption - seat utilization, weekly active users, token counts. As HRD has reported, that approach produced Amazon's tokenmaxxing crisis, where employees gamed AI usage leaderboards by running low-value tasks purely to inflate their scores. When the metric became the target, it stopped measuring anything useful.

The crisis exposed something deeper: most organizations, including most HR teams, can't distinguish between AI activity and AI value because they don't understand what the tools actually do, or where they fail. As Gartner's Swagatam Basu put it: "Most leaders are mistaking basic access or adoption metrics for transformation. This 'enablement illusion' is hiding risks and draining ROI."

Working knowledge of the main AI platforms - not becoming a technologist, but understanding what distinguishes them - is what allows HR to make defensible decisions about which tools touch which work, what data should and shouldn't flow through them, and what "the AI said so" actually means when someone uses it to justify an employment decision.

The five platforms HR needs to understand

ChatGPT (GPT-5.5) is the platform most employees arrived at first, and for many organizations it remains the default. Its strength is breadth - it handles drafting, research, data summarization and workflow automation competently across a wide range of tasks, and its interface is the easiest to onboard staff to. On Harvey's BigLaw Bench, an independent benchmark for legal and professional document reasoning, GPT-5.5 scored 91.7% in April 2026. Context window is one million tokens, sufficient for most HR document tasks.

The governance-relevant limitation is consistency. GPT-5.5 is a confident model, and confident models fail confidently. On Harvey's Legal Agent Benchmark - a stricter test measuring end-to-end autonomous task completion - it scores 3.75%, significantly below Claude's 10.4%. For employment documentation that a human reviews before acting on, that gap is manageable. For automated HR workflows running without close oversight, it matters more. HR leaders deploying GPT-5.5 for sensitive work - accommodation letters, disciplinary documentation, termination notices - need robust human review protocols, because its failure mode is to be wrong without flagging uncertainty.

Claude (Anthropic) leads Harvey's Legal Agent Benchmark at 10.4%. It doesn't train on inputs from commercial plan users, which matters when employee records, investigation notes or medical accommodation requests are flowing through the system. Enterprise users consistently describe it as more cautious and precise on high-stakes document work - that calibration toward accuracy over confidence counts for HR's most legally exposed outputs. The weaknesses: not the strongest model for numerical analysis or workforce data modeling, and enterprise pricing requires a sales conversation rather than self-service, which slows smaller teams trying to move quickly.

Google Gemini 2.5 Pro makes its case on integration. For HR teams on Google Workspace, Gemini sits natively inside Gmail, Docs, Drive and Meet - AI assistance without switching tools. It handles mixed-format documents well, which matters for onboarding paperwork and benefit forms. On Harvey's Legal Agent Benchmark it scored 0.8%, placing it clearly behind the leaders for complex autonomous document work. It's also the youngest major enterprise platform here, having launched enterprise-grade offerings only in late 2025.

Microsoft Copilot is the platform most HR teams are already technically paying for. It's GPT-5.5 embedded in Microsoft 365, running inside Outlook, Word, Excel, Teams and SharePoint with your organization's existing security policies and compliance controls inherited automatically. For HR operations teams processing high volumes of similar transactions - onboarding sequences, policy acknowledgments, benefits queries - the 2026 workflow automation capabilities have real operational value. The Excel integration is worth noting specifically: Copilot can work directly inside compensation spreadsheets and headcount models without moving data outside the Microsoft environment. The pricing, though, needs scrutiny. Copilot Business at $30 per user per month sits on top of a mandatory Microsoft 365 base subscription, bringing the true all-in cost to approximately $42.50 per user per month.

Grok 4 (xAI) is the only platform that updates on live data. Every other model works from a training cutoff; Grok draws on information published this morning. For HR's need to stay current on EEOC guidance, DOL regulatory updates, state-level employment law developments and fast-moving court decisions, that's a genuine advantage no other platform offers. The caveat: xAI is the youngest major platform by a considerable margin, with the thinnest enterprise compliance track record. Employment decisions, performance records and sensitive personal data need more governance certainty than a new vendor's history currently provides. Use it for research and intelligence; hold off building sensitive employment workflows around it.

One platform not on this list for HR use: DeepSeek, the Chinese open-source model that benchmarks competitively and costs almost nothing. DeepSeek's own privacy policy states that data is stored in China and subject to Chinese law, including legislation requiring organizations to cooperate with state intelligence on request. The U.S. National Counterintelligence and Security Center has issued specific warnings. Any HR team loading employee records, investigation notes, accommodation requests or compensation data into DeepSeek's public interface is creating an exposure that's very difficult to defend.

What platform knowledge actually changes

As Lauren McKee, practice leader at LegalVision, warned HRD: "The most common exposure is employees inadvertently inputting confidential or personal information into external AI platforms."

That exposure comes from staff who don't understand what the tool does with what they feed it, working under HR policies that aren't specific enough to actually prevent it.

Platform knowledge turns vague AI policy into enforceable policy. Instead of "use AI responsibly," you can say: don't input personally identifiable employee data into any platform that hasn't been cleared by IT and HR; use frontier models for complex document drafting but route routine tasks to lower-cost alternatives; any AI-generated output that will inform an employment decision requires documented human review before it's acted on. The first version sounds good in a handbook. The second version holds up when something goes wrong.

68% of employees are using unsanctioned AI tools, and 57% are feeding sensitive corporate data into them, HRD has reported. HR is best positioned to fix that - but only if it knows enough about the actual tools to design solutions that address what employees are doing, rather than what a generic policy assumes they're doing.

Tokens, cost, and the trust problem HR ends up managing

AI platforms that once charged a flat monthly subscription are shifting to token-based billing - charging for every prompt, every document processed, every automated agent task. Tokens are roughly three-quarters of a word; a comprehensive workforce planning analysis might consume 200,000 tokens, a complex employee relations letter might use 4,000, and an onboarding agent running for an hour might burn through 500,000 or more.

77% of technology leaders say they lack full visibility into real-time AI spend, according to IBM's 2026 report. When that spend is invisible, the usage caps and rollback decisions that follow land without context - and without context, they read to employees as arbitrary restrictions on tools they were told were essential. That trust damage is HR's problem to manage. As HRD has covered, companies that rushed to incentivize AI adoption are now trying to walk those incentives back, and the communications fallout lands in HR's inbox.

Being able to explain to employees why a usage guideline exists - "heavy document work costs more than a quick email summary; we're asking you to match the tool to the task" - is the difference between a policy that lands as a reasonable operational decision and one that reads as institutional distrust.

The audit before the strategy

Most organizations aren't choosing between these platforms. They're inheriting whichever ones got deployed before governance caught up - which, per the data, is most of them. Darren Lonsdale, managing director at Prosci ANZ, put the precondition plainly to HRD: "If there's weak executive sponsorship or no executive sponsorship, then the program's destined to fail before you start."

For most HR teams, the practical starting point isn't a platform selection exercise. It's an audit: which AI tools are actually in use across the organization, what employee data is flowing through them, what governance currently exists, and where the real exposure sits. That picture - not a vendor comparison matrix - is what makes the governance questions actionable.

Kate Major, head of people and culture at ACC New Zealand, told HRD: "If you design it from a people-first perspective, then you get better adoption, you get better outcomes and you get fewer unintended impacts." That applies to governance as much as adoption. Policies written by people who understand what the tools actually do tend to hold. Policies written in the abstract tend to get routed around.

The 18% achieving transformational value from AI got there by understanding what they were deploying before they deployed it at scale. That's not a technology advantage; it's a knowledge one.