A new explainability standard

Stop explaining AI like software.
Start reserving against it like an actuary.

Today's AI governance inherits software-engineering defaults: one accuracy number, a "95% confidence" figure, and feature attributions (SHAP / LIME). Those answer which inputs mattered and how often the model is right on average. They are silent on the questions a regulator, a board, or a denied patient actually asks: how badly does each error hurt, how much trust does this output deserve, how much capital should we hold against tail failures, and will a credentialed professional sign their name to it?

Actuarial science answers exactly those questions for uncertain future losses — and has for a century. This project makes the case, with math and a live worked example, that the actuarial toolkit should be the standard for measuring and governing the AI black box.

Open the AI Risk Lens → Read the framework
Tail loss (TVaR₉₅)
$17.4k
per decision — what "95% confidence" throws away
Reserve owed
$15.7M
IBNR model failures not yet surfaced
A/E, 85+ cohort
1.40
drift a single accuracy number hides
Credibility Z, 85+
0.28
how much trust that cohort earns

Figures computed live in the tool from a 10,000-decision synthetic book.

The contribution

An inversion, not another governance checklist

The IAA AI Governance Framework, the SOA AI Task Force, and the NIST AI RMF all describe actuaries governing AI. None proposes the actuarial measurement toolkit as the explainability standard itself. That inversion — using credibility, reserving, A/E studies and TVaR to quantify model risk — is the idea developed here.

The software-engineering default
  • One accuracy number (e.g. "94% AUC") — a portfolio average that hides who gets hurt.
  • "95% confidence" — discards the 5% tail that contains the catastrophe.
  • SHAP / LIME — attribute which features mattered, not how much the error costs.
  • One-time validation — a snapshot, blind to cohort drift after deployment.
  • No reserve, no capital, no signature — nobody is accountable for tail risk.
The actuarial standard (this framework)
  • Frequency × Severity — a full loss distribution: how often and how badly.
  • VaR / TVaR (CTE) — explicitly price the tail "95% confidence" ignores.
  • Credibility (Z = n/(n+k)) — how much trust each output earns from its evidence.
  • Actual-to-Expected studies — continuous, cohort-level drift monitoring.
  • IBNR reserve + economic capital + a signed opinion — accountable governance.
Why now

The law already made actuarial reasoning the yardstick

Under Colorado SB21-169 / Regulation 10-1-1, an insurer's AI is permissible only where differential outcomes have a "legitimate actuarial basis." The NAIC Model AI Bulletin (adopted by 20+ states) demands a governed, risk-commensurate AI program. In insurance, actuarial reasoning is already the statutory test for acceptable AI. This framework formalizes that test into measurable metrics — and shows the same metrics generalize beyond insurance.

See the regulatory anchor and prior-art positioning in the paper →

The intellectual heart

Seven actuarial methods, mapped to AI governance questions

Actuarial methodThe governance question it answers for an AI system
Frequency × SeverityNot one accuracy number, but how often the model errs × how badly each error hurts — a full loss distribution.
Credibility (Bühlmann)How much weight a given AI output deserves vs. the base rate, given how much relevant experience backs it.
IBNR / reserving"Incurred but not reported" model failures — errors baked in but not yet surfaced — and the reserve to hold.
Actual-to-Expected studyCohort-level, statistically rigorous drift / miscalibration monitoring vs. one-time validation.
VaR / TVaR (CTE)Sizing catastrophic tail failure — exactly what "95% confidence" discards in its 5% tail.
Economic capitalThe buffer to hold against AI tail risk (TVaR − expected loss).
Control cycle + signed opinionA credentialed accountability layer: a "Statement of Actuarial AI Opinion" attesting to model risk.
Not just insurance

One lens, any agentic system

The framework needs only four things from a system: a stream of decisions, a definition of an error, a dollar severity, and cohorts to watch. Any AI or agentic system that decides at volume supplies all four — so its errors are an insurable book. The interactive tool measures eight of them, computed live from the same actuarial functions.

Regulated decisioning
Health insurer — prior authorization
Wrongful-denial book; A/E flags the 85+ cohort at 1.40; $15.7M IBNR reserve.
Autonomous operations
BPO — customer-ops agent
Mis-resolutions under client SLAs; A/E flags the KYC queue at 1.43; $3.2M reserve.
Autonomous growth
Marketing — campaign agent
Tiny per-action cost, but the fattest tail (TVaR ≈ 19× expected): the rare brand/compliance blowup.
Autonomous engineering
Software — coding agent
Latent defects: IBNR (453) > reported (302); a $27M reserve a green test suite never shows.
Financial crime / AML
Bank — fraud & AML agent
A/E catches a new laundering typology (crypto, trade-based) before the regulator; $11.2M reserve.
Regulated lending
Lender — credit-underwriting agent
Insurance's near-twin (ECOA fair-lending); seasoning defaults ⇒ IBNR > reported; $23.8M reserve.
Clinical decisioning
Health system — triage agent
Highest stakes: mean severity $15.8k, a $42.9M reserve; A/E flags geriatric & mental-health under-triage.
Customer experience
SaaS — customer-support agent
High-frequency, low-severity, but the tail is a churned enterprise account or a viral security miss.