How to benchmark your team's AI proficiency
Most teams can tell you how many people use AI, almost none can tell you how well. That difference is AI proficiency, and you can benchmark it without turning it into surveillance.
In short
To benchmark AI proficiency, define a literacy-to-fluency ladder, assess people on real role-specific tasks instead of quizzes or self-ratings, map results by role, and re-test on a cadence.
- Then use the map to target training, not to police usage.
- Usage dashboards measure logins; 88% of employees use AI but only 5% use it in advanced ways. Proficiency is the gap between those two numbers.
- A benchmark and surveillance are different things: one is a consented, point-in-time task that produces a training plan the person sees; the other watches behavior covertly.
Why you can't read AI proficiency off a usage dashboard
Most teams measuring AI are measuring the wrong thing. They can tell you how many people logged into a tool last month; they can't tell you whether those people are any good with it. That second question is AI proficiency, and the gap between the two numbers is enormous. EY's 2025 Work Reimagined survey of 15,000 employees found that 88% use AI at work, but only 5% use it in advanced ways that change how they work. A usage log counts the 88%. It says nothing about the 5%, and the distance between them is exactly what you'd want a benchmark to surface. The good news is that AI proficiency is observable and benchmarkable; the catch is that you have to measure capability on real work, not logins, and not self-ratings, and you have to do it in a way that builds people up rather than watches them.
Here's how to benchmark it in a way that produces a training plan instead of a leaderboard.
How to benchmark AI proficiency, step by step
Five steps to measure what your people can actually do with AI, so the result points at training rather than at individuals.
- 1
Define the ladder, literacy to fluency
Write three or four named levels before you measure anything, so a score means something. A workable spine: Aware (knows what tools do and their limits, can spot a bad output), Literate (completes a real task with AI under guidance), Proficient (independently breaks a task down, picks the right tool, edits and verifies), Fluent (redesigns how the work gets done with AI and helps others). The EY data lives in the gap between Literate and Fluent.
- 2
Assess on real tasks, not quizzes
A multiple-choice test measures recall, not capability. Give a representative work artifact and watch the process: do they clarify the goal and constraints first, do they choose a tool deliberately and say why, do they correct and verify the output or accept it blind. Score the work, not a self-rating, because self-perception often deviates from real ability.
- 3
Benchmark by role, not one company-wide bar
Fluent for a recruiter, a financial analyst, and a support lead are different skills. Build the task and the rubric from the actual job. A single company average hides where the real gap sits, which is usually on the frontline, not in the leadership team.
- 4
Turn the map into a training plan
A benchmark that doesn't change what you teach is just a scoreboard. Read the distribution by level and by role, then target training where the cluster sits. BCG found that five or more hours of training, delivered with coaching, is what actually moves people up a rung.
- 5
Re-test on a cadence and recalibrate the bar
Re-assess on a regular interval, quarterly is a reasonable default, and reset what each level means as the tools and the baseline move. Today's Proficient is next year's Literate.
A proficiency benchmark measures capability on a task you both agreed to, and the output is a plan the person sees. Surveillance measures behavior continuously and covertly, and the output is a dashboard the person never sees. Different inputs, different outputs, different intent.
"You can't measure a soft skill" and "this is just surveillance"
Two objections come up every time, and both deserve a straight answer. The first: you can't put a number on judgment. Partly true, and a single number would be reductive, but you're not scoring a personality trait. You're observing whether someone can complete a real, role-specific task with AI and verify the result, and the rubric of decompose, select, and oversee makes a fuzzy thing legible without pretending it's one digit. The second objection is the one that matters most for how this lands: isn't benchmarking just surveillance with a nicer name? It has real evidence behind it, monitored workers report markedly higher stress, and trust in AI has fallen even as adoption rises, partly because people hide the very tools that make them productive when they feel watched. The answer is in the design. A benchmark is point-in-time and consented, built from a task the person agreed to, and its output is a training plan they see. That is the opposite of a covert behavior dashboard. Get that distinction right and benchmarking levels people up; get it wrong and you teach your team to hide. This is the human side of any AI transformation, and it sits inside a broader AI adoption effort rather than off to the side.
Why the benchmark has to be role-specific
The reason a company-wide AI proficiency score misleads is that the average hides the gap that matters. BCG's 2025 research found regular AI use among leaders and managers has climbed while frontline use stalled around 51%, a 'silicon ceiling' that a single number papers right over. Benchmark by role and the picture sharpens: you see that your support team is Literate but stuck, or that your analysts are Proficient and ready to go Fluent, and you can aim training accordingly instead of broadcasting the same generic course to everyone. That is why proficiency work runs role by role, the way our roles tracks do, and why rolling it across a whole team means measuring each function on its own work. The benchmark isn't the deliverable; the targeted training it points to is, and that is where the business return shows up, since organizations with mature, organization-wide upskilling are far more likely to report real AI ROI.
Common questions
What is AI proficiency?
AI proficiency is how well someone can actually do their work with AI, as opposed to whether they use it at all. It runs along a ladder from Aware (knows what the tools do and their limits) through Literate and Proficient to Fluent (redesigns how the work gets done and helps others). EY found 88% of employees use AI but only 5% use it in advanced ways, and that gap is what proficiency measures.
How do you measure AI proficiency?
Not with a quiz or a self-rating, both of which miss real ability. Give people a representative, role-specific task and watch the process: do they clarify the goal, pick a tool deliberately, and verify the output. Score the work against a simple rubric of decompose, select, and oversee, benchmark by role, and re-test on a cadence so the bar keeps up with the tools.
Isn't benchmarking AI skills just surveillance?
No, if you design it right. A benchmark is point-in-time and consented, built from a task the person agreed to, and its output is a training plan they see. Surveillance watches behavior continuously and covertly, and its output is a dashboard the person never sees. Monitored workers report higher stress and tend to hide their AI use, so a covert approach defeats itself. Benchmark to level people up, not to police them.
How often should you re-benchmark AI proficiency?
Quarterly is a reasonable default, and you should reset what each level means as the tools and the baseline move. AI capability that looked advanced a year ago is now table stakes, so a static bar quietly understates the gap. Re-testing on a cadence keeps the training plan pointed at the right people.
Benchmark AI proficiency, then close the gap
Candova AI maps where each role actually stands with AI and delivers the hands-on, role-specific training that moves people up the ladder, with coaching on their real work.
Power users save 10+ hours a week. Learn how.
The practical AI habits behind it, one a week.

Written by
Laura Dansbury
SVP of Product and Content at Candova
Laura has spent more than 15 years building and scaling products across consumer and B2B, with product and UX leadership roles at LinkedIn, Ancestry, and Movoto before Study.com and Candova. Her work has consistently centered on the same thing: turning a strategy into a product real people actually use, and getting the conversion and growth numbers to prove it.