HIT Consultant reframes AI evaluation

- HIT Consultant on May 15 published a playbook by Cota Capital's Vikram Venkat urging healthcare AI buyers to evaluate safety, fairness and workflow fit. - The piece cited a review finding 95.4% of healthcare LLM studies focused on accuracy, while only 5% used real patient-care data. - The full framework and cited studies are available in Venkat's May 15 article and linked research on HIT Consultant.

Vikram Venkat, a principal at Cota Capital, used a May 15 article in HIT Consultant to argue that healthcare AI should be judged less by demo-style accuracy scores and more by how systems behave in live clinical and operational settings. Venkat wrote that hospitals and vendors should measure safety, fairness, calibration, workflow compatibility and human-AI reliability alongside standard metrics such as AUROC, F1 and recall. The article focused on healthcare broadly, but its examples map closely to behavioral-health and utilization-review workflows where staff decisions, payer deadlines and missing documentation can compound errors. A 2024 systematic review cited in the piece found that 95.4% of healthcare large-language-model evaluation studies primarily measured accuracy, while fairness, bias and toxicity were measured in 15.8%, deployment considerations in 4.6%, and calibration and uncertainty in 1.2%. The same review found only 5% of studies used real patient-care data for evaluation. (hitconsultant.net) ### Why is raw accuracy no longer the main benchmark? The HIT Consultant piece said retrospective accuracy is "necessary but not sufficient" for real-world deployment because clinical and administrative decisions often depend on thresholds, confidence levels and local workflows, not just rank-order predictions. Venkat wrote that a model can score well on historical data and still fail if it is poorly calibrated, brittle across sites or hard for staff to use under time pressure. (pubmed.ncbi.nlm.nih.gov) The U.S. hospital market gives that argument more weight because predictive AI is already widespread. A federal data brief based on the 2023 and 2024 American Hospital Association IT Supplement found 71% of nonfederal acute care hospitals reported using predictive AI integrated with electronic health records in 2024, up from 66% in 2023. (hitconsultant.net) ### What does the playbook say teams should measure instead? Venkat wrote that healthcare organizations should test for calibration and uncertainty, subgroup fairness, robustness across settings, and performance when humans and AI are working together. In behavioral-health and utilization-review settings, that logic points to operational measures such as override rates, downstream rework, denial-related follow-up and whether staff can safely overrule a system recommendation when records are incomplete or payer criteria change. (healthit.gov) The article says continuous evaluation across the AI lifecycle is more useful than one-time external validation. ARGO Managing Director Shanna Dugan described the kind of workflow strain those measures are meant to catch in a separate April 17 HIT Consultant article on behavioral-health utilization review. Dugan wrote that a single missing note or delayed assessment can trigger hours of rework or an automatic denial, and that many teams still rely on spreadsheets, shared inboxes and manual chart review under tight deadlines. (hitconsultant.net) ### What evidence does the article use on safety and bias? Venkat cited a study reporting that clinicians shown biased AI predictions saw diagnostic accuracy worsen by 11.3%. The article used that finding to argue that model bias can degrade clinician performance rather than simply produce a bad standalone output. (hitconsultant.net) The American Nurses Association made a related point on May 6 in a consensus report covered by HIT Consultant. The association said "automation bias" and erosion of professional judgment are primary risks to patient safety and that human oversight must remain non-negotiable, with nurses as final decision-makers. (hitconsultant.net) ### Why does this matter for behavioral-health automation? Behavioral-health workflows often combine clinical judgment, payer documentation and urgent timing, which makes downstream effects visible quickly. Dugan wrote that utilization-review nurses may manage dozens of cases at once and that delayed clinical notes or fragmented documentation can weaken authorization calls and increase denials. In that setting, a tool that looks accurate in testing but raises override rates or creates more rework can still damage operations. (hitconsultant.net) Venkat's article does not announce a product or standard, but it gives hospital operators and vendors a checklist for procurement and deployment reviews. The May 15 HIT Consultant post includes the cited studies and framework categories, and the next step for readers is the source article itself and the underlying research it references. (hitconsultant.net 1) (hitconsultant.net 2)

HIT Consultant reframes AI evaluation

Get your own daily briefing