Measuring ROI on Government AI Deployments

Why Government AI ROI Math Looks Different from Commercial

Commercial AI ROI models are designed to convince a CFO that an enterprise software investment will pay back through revenue lift or cost reduction inside a private P&L. Government ROI math has a different structure and a different audience. The agency leader presenting to a city council, a state legislature, a Congressional appropriations subcommittee, or an OMB Passback review needs three things the commercial model rarely produces.

First, the cost analysis must be defensible at line-item granularity. "We will save 30%" is not a budget submission. "We will reduce contracted call-center spend by $4.2M annually beginning Q3 of FY27, with one-time implementation cost of $1.6M in FY26 and continuing operating cost of $850K annually" is a budget submission. The numbers must trace to specific line items in the existing appropriation and to specific budget objects in the next request.

Second, cost reduction alone is rarely enough. Government leaders are evaluated on policy outcomes. The Medicaid director is judged on procedural disenrollment rate. The court administrator is judged on FTA reduction. The 311 director is judged on resolution time and resident satisfaction. The CIO is judged on whether the system actually delivered the policy outcome, not just whether the spend went down. ROI presentations that lead with cost and bury outcome lose to presentations that lead with outcome and substantiate with cost.

Third, the analysis must respect the multi-year horizon and the funding-cycle realities of government. State funding cycles run 1-2 years, federal civilian runs 1 year with multi-year supplemental funding, capital improvement programs run 3-5 years. The ROI calculation needs to be expressed in the funding-cycle terms the audience uses, not in the rolling 12-month payback frames a private CFO uses. The legislative requestor's question is "what does this cost in FY26, FY27, and FY28, and what does it save in those same years."

Building the Baseline: What You Need Before You Can Measure Anything

The single most common reason an AI voice business case fails post-deployment is that the agency never built a defensible baseline. Without a baseline, every "savings" claim post-go-live becomes a debate. With a baseline, the savings are arithmetic.

Fully loaded cost per handled call. Not the agent wage. The full annual cost of running the contact center divided by handled call volume. Includes agent labor, supervisors, QA staff, training, facilities, telecom, technology licenses, BPO contract overhead, IT support overhead, HR allocation, and benefits. Most agencies underestimate this by 40-60% because they only count direct labor.
Call volume by intent. Top 10-20 intents the contact center actually handles, with monthly volume. Without this segmentation, AI containment projections are meaningless because containment varies dramatically by intent (status checks: 95%+; complex casework: 5%-15%).
Call volume by language. English / Spanish / next 3-8 languages including LanguageLine fallback usage. LEP volume is often 20-40% of total and disproportionately expensive under per-minute interpreter contracts.
Service level and abandonment baseline. Service level (% answered within target seconds), abandonment rate, average speed to answer, average handle time, average hold time, after-call work time, callback rate, repeat-caller rate.
First-call resolution rate. The single most predictive metric for citizen satisfaction. Often poorly measured at baseline because the data is hard to extract from legacy ACDs.
Per-minute interpreter spend. LanguageLine, Voiance, CyraCom, Propio, Akorbi, GLOBO. Annual line-item spend, by language, with average minutes per call.
FTE roster and reallocation potential. How many FTEs sit in the contact center today. Which roles are tied to specific call types. Which roles are at risk in the agency's next budget cycle and which are protected. Which would be redeployed to higher-value casework versus being eliminated.
Policy outcome baseline. The political KPIs the agency leader will be measured on. Examples: Medicaid procedural disenrollment rate, FTA rate, permit issuance time, right-party contact rate on outbound benefit calls, 311 resolution time. AI ROI must be tied to changes in these.
Customer experience baseline. CSAT or NPS where measured. Voice-of-customer themes from existing surveys.
Compliance posture cost. What does the agency currently pay for HIPAA, MARS-E, ARS, FedRAMP, StateRAMP, IRS Pub 1075, Section 508, Section 1557 compliance on the existing platform. AI replaces some of this overhead and adds some of its own.

📊

Plan for 4-8 weeks of baseline work before AI evaluation begins. Most agencies need to pull data from 3-6 different source systems (ACD, workforce management, IVR analytics, interpreter contract billing, EHR/MMIS, CRM) and reconcile inconsistent definitions. The baseline document is the most important artifact in the entire AI program because every subsequent ROI claim is measured against it.

How an Agency Builds the AI Voice Business Case

Frame the policy outcome first. What outcome is this AI investment supposed to improve? Procedural disenrollment? Court FTA? Permit cycle time? LEP service equity? The outcome anchors every subsequent metric.
Build the baseline (4-8 weeks). See the previous section. Without this, the rest of the case is hand-waving.
Segment the call portfolio by AI containment potential. Group calls by intent. For each intent, estimate the realistic AI containment rate based on benchmark data and pilot outcomes from peer agencies. Status-lookup intents: 90%+. Recertification reminders: 80%+. Complex eligibility determinations: 10%-25%. Multi-step casework: 5%-15%. Containment is not uniform; the business case must reflect that.
Project the post-AI operational state. For each intent: AI handles X%, warm-transfers Y%, human handles 100% of Z%. Project the resulting cost per call, service level, and abandonment by intent.
Calculate net savings. Total post-AI cost vs. baseline cost. Subtract one-time implementation cost (integration, training, change management, ATO/MARS-E authorization, professional services). Subtract ongoing AI platform cost. Add back any cost that survives the deployment (residual interpreter contract, supervisory headcount, QA, compliance overhead).
Project the policy outcome change. Based on pilot data and peer benchmarks, project the change on the framing policy outcome. Be conservative; agency leaders have seen too many overpromises.
Build the multi-year cash flow. Year 1: net cost (implementation + partial benefit). Year 2: cash positive. Year 3: full benefit. Express in the agency's fiscal calendar, not rolling months.
Map to procurement vehicle and timeline. Which existing contract, vehicle, or solicitation will fund this? When does the next budget cycle close? When does the existing contact-center contract come up for re-procurement?
Document the risks and mitigations. ATO timeline risk, integration risk, change management risk, vendor concentration risk. Document each with a mitigation plan. Risks honestly disclosed are far more credible than risks omitted.
Build the agency leader's one-page summary. Top half: the policy outcome and the cash math in three lines. Bottom half: the implementation timeline and the risks. Everything else is appendix.

The Six Categories of Metrics That Matter

1. Cost Metrics

Fully loaded cost per handled call. Cost per intent type. Per-minute interpreter spend. Annual contract value. Capital vs. operating expense breakdown. Cost per FTE-equivalent hour freed.

2. Service Quality Metrics

Service level (% answered within target). Abandonment rate. Average speed to answer. Average handle time. First-call resolution rate. Repeat-caller rate. Caller-effort score. CSAT / NPS where measured.

3. Containment and Throughput Metrics

AI containment rate by intent (the share of calls AI fully resolves without warm transfer). Warm-transfer rate. Calls per hour of AI capacity. Concurrent call capacity. Outbound dial throughput.

4. Policy Outcome Metrics

The metrics specific to the agency's mission. For Medicaid: procedural disenrollment rate, renewal completion rate, reconsideration-period reinstatement rate. For courts: FTA rate, fine collection rate. For permitting: cycle time, status-inquiry deflection. For 311: resolution time, repeat-call rate. For public health: appointment completion, outreach reach rate.

5. Equity and Access Metrics

LEP language coverage. Right-party contact rate by language. Service-level parity across language and demographic groups. Disability access (TTY/RTT/VRS) usage and resolution. Equity is no longer a soft metric; it is reportable to OMB under multiple federal directives and to state legislatures under state language access laws.

6. Compliance and Risk Metrics

ATO status and renewal date. POA&M open-finding count by severity. Audit log completeness. Vulnerability scan compliance rate. Incident response SLA performance. Privacy impact assessment currency. Bias and disparate impact testing results for the AI model.

Where the Numbers Come From: Source Systems and Reporting

ACD and contact center platforms. Amazon Connect, Genesys Cloud, Five9, NICE CXone, Cisco Webex Contact Center, Avaya. Provide call volume, AHT, ASA, abandonment, service level.
Workforce management. NICE WFM, Verint, Calabrio, Aspect. Provide schedule adherence, occupancy, FTE utilization.
IVR / self-service analytics. Provide containment rates by self-service path before AI overlay.
Quality monitoring. Calabrio QM, Verint Speech Analytics, NICE Nexidia. Provide first-call resolution proxies, sentiment, compliance scoring.
Interpreter contract billing. LanguageLine, Voiance, CyraCom, Propio, Akorbi, GLOBO invoices and call detail records. Provide per-minute spend, language mix, average call time.
CRM and case management. Salesforce Public Sector, ServiceNow Public Sector, Microsoft Dynamics Government, NEOGOV, Tyler Public Sector. Provide case outcomes that link calls to dispositions.
System of record (the policy outcome layer). MMIS (Gainwell, Conduent, Optum), IES (Deloitte, Accenture, Wipro), state Medicaid platforms, court CMS (Tyler Odyssey, Journal Technologies eCourt), permit/licensing (Accela, Tyler EnerGov, Cityworks), 311 (Salesforce, ServiceNow, Cityworks PLL), utility billing (Tyler Munis, Cayenta, Oracle CC&B). Provide the policy outcome metrics.
Survey and CSAT. Qualtrics, Medallia, SurveyMonkey, in-call surveys. Provide caller satisfaction baseline.
FOIA and audit logs. Tamper-evident logs that satisfy state and federal records-retention obligations.
Executive BI layer. Tableau, Power BI, Looker, agency-built dashboards. Where the agency leader actually reads the numbers.

ROI Reporting Inside the Compliance Envelope

ROI reporting on a federal or federally funded AI deployment must operate inside the compliance posture the deployment carries. The reporting design itself is an audit object.

HIPAA and minimum-necessary. ROI dashboards must not expose PHI to viewers without authorization. Aggregate volumes and rates are fine; per-call records are not, except to authorized roles.
MARS-E and ARS audit logs. Every metric report run is itself an event that must be logged. Privileged users querying the data warehouse leave an audit trail.
OMB M-24-10 AI use-case inventory. Federal agencies maintain an AI use-case inventory. ROI dashboards for the deployment feed the inventory's effectiveness reporting.
NIST AI RMF measurement function. The "MEASURE" function of NIST AI RMF asks agencies to assess AI system performance, trustworthiness, and risk on an ongoing basis. ROI reports satisfy part of this function when properly scoped.
Equity and bias reporting. Federal directives require periodic disparate-impact assessment of AI systems used for benefit, eligibility, or service delivery. The ROI reporting framework should include disaggregated metrics by demographic and language to enable this.
FOIA exposure. Government dashboards may be subject to FOIA. Design accordingly - aggregate metrics and methodology can be released without exposing individual case data.
State auditor and HHS OIG reviews. State auditors increasingly review AI deployments for both efficacy and risk. The ROI documentation is the artifact reviewed.
Section 508 of the dashboard itself. The dashboard interface used by agency staff must meet accessibility standards.

The Reference Before-After KPI Table

Metric	Before AI	After AI (Steady State)
Fully loaded cost per handled call	$7-$22	$2.50-$8 (blended human + AI)
AI containment rate (volumetric calls)	n/a	65-85%
Service level (% answered within 30s)	50-70%	95-99%
Abandonment rate	14-32%	3-8%
Average speed to answer	90-450 seconds	Under 5 seconds
Average handle time (resolved calls)	7-14 minutes	3-6 minutes (AI) · 8-12 minutes (escalated)
First-call resolution rate	45-65%	78-92%
Right-party contact (outbound LEP)	14-22%	42-58%
Per-minute interpreter spend	$1.20-$3.50/min	Near-zero on AI-handled · fallback only
Outbound dial throughput	Caseworker-limited	60K-250K calls/day
Languages with native conversational coverage	2-4 + interpreter line	60+ native
After-hours coverage	Limited	24/7
FTE-hours freed per month (mid-size agency)	baseline	1,500-4,500 hours
Cash payback period	n/a	6-14 months
Year-3 ROI (vs. cumulative investment)	n/a	3-6x

Two notes on the table. First, these are reference ranges from production deployments. The actual numbers for a specific agency will land somewhere inside these ranges depending on call complexity, integration depth, and starting baseline; vendor projections that promise the top of every range simultaneously are usually optimistic. Second, the policy outcome metrics (procedural disenrollment, FTA rate, permit cycle time, etc.) are not in this generic table because they vary by agency mission. The business case must include them - just calibrated to the specific outcome the agency is accountable for.

Mapping ROI to the Government Budget Cycle

Government funding cycles do not map to commercial 12-month payback frames. The ROI presentation has to express benefits and costs in the same cycle the funding decision is made.

Federal annual appropriation. Standard federal civilian. Funding is made annually; the agency must show the FY benefit and the FY cost. Multi-year benefit projection is supplemental but not the lead.
Federal multi-year supplemental. Some federal agencies have multi-year IT modernization funding (TMF, IT Working Capital Funds). ROI presentation can extend over the life of the supplemental.
State biennial budget. Many states budget on a 2-year cycle (Texas, Wisconsin, Oregon, Virginia, others). The ROI math is presented in 2-year frames matching the budget request.
State annual budget. Most states. ROI presented year by year matching legislative review.
Capital improvement program. 3-5 year cycle for capital-funded technology. AI implementation cost can be capitalized in some jurisdictions; operating cost is annual.
Federal grant cycle. CMS unwinding TA, HRSA grants, SAMHSA expansion grants, ACF program funding. Often 1-3 year award periods. AI investment can be scoped to the grant award period.
MCO/contractor capitation. Where the work is delivered by an MCO or capitated contractor, AI investment may be funded under the existing capitation rate without a separate appropriation. ROI flows to the MCO P&L, with state oversight on quality measure performance.
Cooperative purchasing vehicles. NASPO ValuePoint, Texas DIR, Sourcewell. Pre-negotiated rates compress procurement timeline; ROI realization starts faster but contract length may constrain investment scope.

Procurement Pathways That Reflect ROI Posture

Pilot first, then scale. Many agencies procure a 6-12 month bounded pilot under existing innovation procurement authority, validate the ROI math against actual call volume, and then scope the production contract. Reduces risk, enables a defensible Phase 2 ask.
Outcome-based pricing where allowed. Some vehicles permit performance-based or outcome-based pricing tied to specific KPIs (containment rate, abandonment reduction, procedural disenrollment reduction). Aligns vendor and agency incentives but requires careful KPI definition.
Existing contact-center contract amendment. Where the agency has an existing contact-center BPO contract, AI scope can be added as a change order with clear before-after ROI baked into the SLA.
Cooperative purchasing. NASPO ValuePoint, Texas DIR (BetaQuick delivers Texas DIR work through partner Compass Solutions, LLC, DIR-CPO-6057, active through October 2030), Sourcewell, OMNIA Partners, COSTARS, GSA MAS.
8(a) sole-source within threshold. Where an 8(a) prime can credibly deliver the scope under the $4.5M services threshold, sole-source procurement compresses the timeline by 6-12 months.
Re-procurement of existing call-center contract. When an existing BPO contract is up for re-bid, AI-native delivery can be specified in the RFP, with the bidders required to show their ROI math against the agency's published baseline.

Frequently Asked Questions

What is the typical cost per call before and after AI voice in government?

Fully loaded cost per handled call in a government contact center, including agent labor, supervision, facilities, technology, training, and benefits, typically ranges from $7 to $22 depending on agency, locality, and call complexity. State Medicaid call centers and federal beneficiary lines tend to land at the higher end. AI voice agent cost per handled call is typically $0.40 to $2.50 depending on call duration, language complexity, and integration depth. The 70-90% per-call cost reduction is the headline number, but the more honest ROI calculation reflects that AI handles roughly 60-85% of call volume natively while humans handle the residual complex cases - which means agency total cost drops in the 35-60% range, not the headline 70-90%, once you account for the human escalation overlay.

How long does it take to see ROI on a government AI voice deployment?

Cash payback on a government AI voice deployment is typically 6-14 months from go-live. The first 90 days after go-live are usually break-even or slightly negative because of integration, training, change management, and tuning. Months 4-9 see steeply positive cash flow as the AI's call resolution rate climbs from initial 35-50% to steady-state 65-85%. By month 12, most deployments are running at 3-6x annual return on the implementation investment. The key driver of payback timing is call volume - high-volume contact centers (200K+ calls per year) see payback in the 6-9 month range; lower-volume deployments (under 50K calls per year) extend to 14-18 months. Deployments that are scoped to a specific high-pain workflow (Medicaid recertification outreach, post-discharge follow-up, permit-status inquiry) tend to show ROI faster than horizontal contact-center modernization.

What metrics should an agency CIO put in an AI voice business case?

A defensible AI voice business case for a government CIO presentation should report nine metrics: (1) call volume by intent and language, (2) baseline cost per call (fully loaded), (3) baseline service level and abandonment rate, (4) baseline first-call resolution, (5) baseline interpreter spend (LanguageLine, Voiance, CyraCom), (6) projected AI containment rate by intent, (7) projected post-AI cost per call, (8) projected change in service level and abandonment, and (9) capacity reallocation - the FTE-equivalent hours freed per month and where they will be redeployed. Pair the operational metrics with policy outcomes the agency leader cares about: reduced procedural disenrollment for Medicaid, reduced FTA rate for courts, faster permit issuance, higher right-party contact on outbound benefit outreach. Quantitative cost analysis alone rarely wins a government funding cycle; cost analysis plus a documented improvement on a politically visible policy outcome usually does.

How do we account for FTE reallocation if the agency is not reducing headcount?

Most government AI deployments do not reduce headcount, both because of civil service protections and because the policy goal is usually to reallocate capacity to higher-value work rather than to cut. The honest ROI accounting captures FTE-equivalent hours freed and the value of the work those hours are redeployed to. For a Medicaid agency, freed caseworker hours might be redirected to complex eligibility determinations, appeals, or LTSS casework where backlogs are causing constituent harm. The reallocation value is not the FTE labor cost; it is the value of clearing backlogs that would otherwise require additional contracted staff or accept worse policy outcomes. Document the reallocation explicitly in the business case so the savings narrative does not collapse the moment a council member asks "are you cutting staff?"

How do equity metrics fit into the ROI calculation?

Equity metrics are increasingly required as a separate dimension of the ROI calculation, not as a sub-bullet under cost. Federal directives (OMB M-24-10, NIST AI RMF), state language access laws, and HHS Office for Civil Rights enforcement all expect agencies to disaggregate AI performance by demographic and language and to report disparate-impact testing results. For the business case, this means baselining service quality and outcome metrics by language, race/ethnicity where the data permits, and geography; projecting how the AI deployment changes those disaggregated metrics; and committing to ongoing monitoring of equity gaps. A deployment that improves average performance but widens the equity gap is not a successful deployment under current federal guidance, and a business case that does not address this risk is incomplete.

Ready to Build Your Agency's AI Voice Business Case?

BetaQuick partners with city, state, and federal agency leaders to baseline existing contact-center operations, project realistic AI containment by intent, and build the multi-year ROI math that wins the next funding cycle. SAM.gov active.

Schedule a Call Contact