Why most AI projects stall at ATO.
You can build a working AI system in eight weeks. You can spend the next eight months trying to get it authorized for federal production. That gap is where most agency AI projects die. Not because the AI does not work. Because the team that built it did not understand what the authorizing official needs to see, did not produce the documentation in a form the AI review office recognizes, and did not bake the controls into the architecture from the first commit.
The work of authorization is not a checkbox added at the end. It is a parallel track that runs alongside the build. The architecture decisions made in week one (where the model runs, what data crosses what boundary, what gets logged, what gets evaluated, who reviews the outputs) are the same decisions that determine whether the system clears review in three months or twelve. Teams that retrofit governance after a build always pay more than teams that wire it in from the start.
This guide is the playbook for what an authorization-ready AI build looks like. The frameworks (NIST AI RMF, OMB M-24-10, FedRAMP), the architecture patterns, the deliverables, the timeline, and what we ship in a discovery sprint to settle these questions before the production build commits.
Two things are out of scope. First, this is a federal-focused guide. State and local agencies use related but distinct frameworks (StateRAMP, state-specific AI policies). The patterns transfer; the specific instruments do not. Second, this is about authorizing AI systems that the agency operates. AI products sold to agencies under FedRAMP marketplace listings follow a different path covered in the commercial section.
The federal AI governance stack.
Federal AI governance is a stack, not a single framework. Each layer sits on top of the next and the bottom layers are non-negotiable. The packages your AI review office actually reads tie all of them together.
The foundation. Every federal IT system needs a FISMA-grade authorization. AI workloads on cloud need a FedRAMP-authorized boundary. Azure Government, AWS GovCloud, and Google Cloud Assured Workloads are the three usual environments. Inside the boundary, the system inherits the cloud provider's authorized control set; outside the boundary, you are doing your own ATO from scratch.
The control catalog every federal system implements. Low / Moderate / High baselines determine which subset applies. AI systems handling CUI or PII almost always land at Moderate; systems touching law enforcement, health, or national security data run High. The System Security Plan documents which controls are implemented, by whom, and how they are verified.
NIST AI RMF defines four functions (Govern, Map, Measure, Manage) and the practices under each. The AI 600-1 GenAI profile adds generative-AI-specific concerns (hallucination, prompt injection, harmful outputs, training data provenance). The mapping artifact ties each NIST AI RMF practice to a specific implementation in your build with evidence (logs, eval results, documentation). This is the artifact your agency reviewer reads first.
The binding policy memo that defines the federal AI risk management requirements. Identifies rights-impacting and safety-impacting use cases and requires additional governance: an impact assessment, public AI inventory entry, ongoing performance monitoring, redress mechanisms for affected individuals. Almost every AI system that touches benefits decisions, public-facing services, or sensitive determinations qualifies. Plan accordingly from day one.
Executive Order 14110 requires agency AI use case inventories, made public. Each agency layers its own AI governance policy on top (DHS, DoD, HHS, VA, GSA all have versions). The agency-specific layer is where most surprise requirements appear; the AI review office will tell you what theirs are early in the conversation if you ask.
One adjacent standard is worth knowing about: ISO 42001, the international AI management system standard. Some agencies and contractors are starting to map their AI governance programs against it as a credibility signal alongside the federal stack. Not required for ATO today; useful for commercial credibility and for organizations doing business across borders.
The 5-stage ATO process for AI.
Federal AI authorization runs as a five-stage process. The timeline ranges from 4 months (low-sensitivity AI on a mature FedRAMP boundary with a well-staffed AI review office) to 12+ months (high-sensitivity AI requiring custom controls and a new agency policy). Plan around the realistic end of that range.
Determine the impact level (Low / Moderate / High under FIPS 199), the AI use case category (general AI vs rights-impacting vs safety-impacting under M-24-10), and the applicable agency-specific policies. This stage produces the scoping document that drives everything downstream. Getting categorization wrong is the most expensive mistake in the process because you discover it during review and have to redo controls work.
Pick the NIST 800-53 baseline (Low / Moderate / High), select the NIST AI RMF practices applicable to your use case, implement the agency-specific additions. Document every control: who implemented it, how it is verified, where the evidence lives. This is the System Security Plan and the AI RMF mapping document combined.
Independent assessment of every implemented control. For traditional FISMA this is the 3PAO or internal assessor. For the AI-specific layer it is the agency AI review office running through the eval pack and the impact assessment. Findings get logged, remediations get tracked, the security assessment report and the AI impact assessment get produced.
The Authorizing Official reviews the package and grants (or denies) the Authority to Operate. For AI systems the AO often consults the AI review office and the agency's privacy office before signing. The output is an ATO letter with conditions: monitoring frequency, reauthorization timeline, plan-of-action-and-milestones for any outstanding items.
Continuous monitoring of the technical controls (FedRAMP ConMon) plus AI-specific ongoing measurement (accuracy drift, fairness metrics over time, incident reporting, model version changes). Material drift or a model change triggers a reauthorization conversation. Continuous monitoring is the line item teams forget to staff for and pay for later.
Two parallel tracks run alongside the main process. Privacy compliance (Privacy Impact Assessment, System of Records Notice if applicable, Computer Matching Agreement for data sharing) runs through the agency privacy office. Public inventory and disclosure (M-24-10 inventory entry, public-facing redress mechanism if the use case is rights-impacting) runs through the agency communications and policy offices. Plan timelines for these in week one or they become the critical path in week thirty.
The eval pack.
The eval pack is the single most-read artifact in the authorization package. It is the bundled evidence that the AI system meets its accuracy, safety, fairness, and security commitments. A good eval pack is concise (under 40 pages of executive-readable content, with technical appendices) and gives the AI review office everything they need to ask sharp questions and grant approval.
The contents that hold up across agencies:
- Baseline accuracy on a labeled hold-out set, broken down by data segment. Top-1 accuracy is the headline; per-segment accuracy reveals fairness issues.
- Bias and fairness tests across protected classes. Demographic parity, equal opportunity, or calibration metrics depending on the use case. Run on representative test data, not synthetic.
- Robustness tests against adversarial inputs. Typo robustness, paraphrase robustness, common edge cases that production traffic will produce.
- Red-team findings and remediation. A documented adversarial exercise covering jailbreaks, prompt injection, training data leakage, harmful output generation, social engineering. Findings without remediation are worse than no red team; document both.
- Prompt injection test results for LLM-based systems. Test the system with malicious user inputs that try to override instructions, leak system prompts, or escalate privileges.
- Hallucination rate measurements for generative use cases. Specific to the question types your system handles, with a defined hallucination definition.
- Drift detection thresholds. What level of accuracy degradation triggers an alert. Who gets the alert. What the remediation process is.
- Monitoring runbook. The metrics that get watched in production, who watches them, on what cadence, what triggers reauthorization conversations.
- Model card for each model version, including training data summary (where lawful to share), intended use, known limitations, evaluation results.
- Datasheet for the training and evaluation datasets, including provenance, licensing, sensitive content, and refresh cadence.
The pack lives in the agency's GRC system (eMASS for DoD and many civilian agencies, Xacta for some, ServiceNow GRC for newer deployments). The format matters less than the content; what kills approvals is missing evidence on a control that the reviewer expected to see covered.
FedRAMP-aligned model options.
The model choice determines the boundary, the latency budget, and the contractual posture. Three categories of model deployment cover almost every federal AI workload today.
Managed FedRAMP-authorized models
Azure OpenAI Service in Azure Government carries FedRAMP High authorization and offers GPT-4 class and o-series models inside the boundary. Standard for civilian agencies running on Azure. AWS Bedrock in GovCloud carries FedRAMP High and offers Claude, Llama 3, and Titan models. Standard for agencies on AWS GovCloud. Google Vertex AI on Assured Workloads supports FedRAMP boundary deployment for agencies on GCP. Pick the one your existing FedRAMP boundary uses; do not stand up a second cloud just for the model.
Self-hosted open-weight models
Llama 3, Qwen, Mistral, and other open-weight models can be self-hosted inside a FedRAMP-authorized boundary on GovCloud, Azure Government, or an on-premise environment with an existing ATO. The model inherits the boundary's authorization. Latency and cost favor self-hosting at scale; the operational burden (model serving infrastructure, MLOps pipeline, capacity management) is real and worth budgeting for. Production agencies use this for sensitive workloads where contractual no-training terms from a managed provider are not sufficient.
Specialized FedRAMP-authorized AI services
FedRAMP-authorized AI services for specific tasks: AWS Comprehend (NLP, PII redaction), Azure AI Document Intelligence (forms/document processing), AWS Transcribe (speech-to-text), Azure Speech Services (speech-to-text with PHI redaction). These are typically lower-risk to deploy because the use case is narrow and the FedRAMP authorization already covers the service. Used as supporting components alongside the primary LLM in many builds.
The contractual posture matters as much as the technical fit. No-training terms (the model provider does not train on your data) must be contractually documented, not just verbally asserted. For Azure OpenAI Service the terms are in the contract by default; for AWS Bedrock the no-training defaults are in the service terms but worth verifying; for any third-party API the no-training clause is the first thing the agency contracting office will look for.
Ongoing monitoring and reauthorization.
ATO is not granted forever. Federal authorizations expire (typically 3 years) and require reauthorization. AI-specific changes can trigger reauthorization sooner: a new model version, a meaningful change in the eval pack, an incident in production, a change in the underlying FedRAMP authorization.
The continuous monitoring program for an AI system has three concentric loops.
- Technical security monitoring (FedRAMP ConMon). Standard FISMA continuous monitoring: vulnerability scans, control reviews, incident reporting. Cadence is monthly for moderate, weekly for high. The cloud provider handles the boundary; you handle anything you built on top.
- AI performance monitoring. Accuracy drift against the labeled hold-out set, run weekly. Fairness metrics, run monthly. Latency, throughput, error rates, run continuously. Thresholds defined in the eval pack. Alerts route to the operations team and the AI review office.
- Incident reporting. When the AI produces an output that causes harm (incorrect benefit decision, biased outcome, sensitive disclosure), the incident is logged, reported to the agency, investigated, and remediated. Pattern of incidents triggers reauthorization conversation.
Model version changes are the most common reauthorization trigger. A new model version (GPT-4o → GPT-4.5, Claude 3.5 → Claude 4) shifts the eval pack baseline and requires a fresh evaluation run. Agencies vary widely on how they handle this: some require a re-test and a memo; some require a full reassessment; some treat it like any other patch. Confirm the agency's posture before the build commits.
Open-source model updates for self-hosted deployments behave similarly. Pinning a model version in production and choosing when to update is the right pattern; auto-updating to latest is a fast way to break your ATO.
When this is the right capability.
Governance-as-capability pays off when:
- You are building or acquiring a federal AI system. Production deployment to a federal network, processing government data, producing decisions that the agency owns. The ATO is unavoidable.
- You are a federal prime or sub. Your customer is the agency. Your AI capability has to clear their authorization. The vehicle is your delivery contract.
- You are a commercial vendor selling AI to federal customers. FedRAMP marketplace listing (Moderate or High) is the entry ticket; agency-specific authorizations layer on top.
- You are a healthcare or financial-services organization treating AI as a regulated capability. Different frameworks (HIPAA, SR 11-7, SOC 2, ISO 42001) but the same shape of governance work: documented architecture, eval pack, ongoing monitoring, incident reporting.
- You are bidding on AI work and need credible governance language in the proposal. Even if no ATO is involved, the procurement team will score governance posture.
When it is not the right answer.
If any of the following describe your situation, the full federal governance package is overkill or misdirected.
- Research and experimentation environment. AI prototyping in a sandbox, with synthetic or de-identified data, no production deployment. A lightweight governance posture (model card, internal eval, clear sandbox boundary) is enough.
- Pure commercial deployment with no federal customer. SOC 2, ISO 42001, sector-specific compliance (HIPAA, PCI, SR 11-7) are the right frameworks. NIST AI RMF is voluntary and useful; OMB M-24-10 does not apply.
- The agency does not have an AI review process yet. Some smaller agencies are still building theirs. Working without that gate is faster in the short term, but the policy is coming and retrofitting is expensive. We recommend building to the framework even if the gate is informal.
- Cost-prohibitive at scale. If the AI use case is small and contained, the full ATO package may cost more than the system saves. The honest framing is: pick a simpler authorization path (assessment-only, type authorization inheritance) and document the limited scope.
Saying no early is cheaper than discovering it during review. The discovery sprint exists partly to scope governance correctly.
Buyer's checklist.
If you are evaluating a vendor's AI governance posture for a federal engagement, the questions below separate production-ready answers from "we know the acronyms." Use the list in a vendor conversation; the ones who cannot answer ten of twelve crisply are not ready for federal work.
- Show me the NIST AI RMF mapping for a comparable production system. Practice-by-practice with evidence.
- What FedRAMP boundary does the AI run inside? Cloud, region, authorization level, current authorization date.
- What model do you run, with what no-training terms? Specific model name, provider, contractual evidence.
- Show me the eval pack from a prior deployment (redacted as needed). Accuracy, bias, robustness, red-team, prompt injection.
- What is your impact assessment process under OMB M-24-10? Who completes it, how long does it take, what is the deliverable.
- How do you handle model version changes? Re-test, re-authorize, document in plan of action.
- What is your incident response process for AI-specific incidents (bias outputs, hallucinations causing harm, leaked data)?
- How is PII / PHI / CUI redaction handled before the model sees inputs?
- What does your continuous monitoring look like for accuracy drift and fairness drift?
- How do you participate in agency AI review? Have you been through it before? Which agencies, what timeline.
- What artifacts transfer to our team if we take the build in-house? Eval suite, monitoring code, documentation.
- What is the total cost of governance over the 3-year authorization cycle? Build, authorization, ongoing monitoring.
Crisp answers to ten of twelve with named tech, named agencies, and production data means the vendor has done this before. Fewer than seven means the vendor is selling federal-sounding language without the underlying work. We go through the same twelve in a discovery sprint.
What's in the discovery sprint.
The governance discovery sprint runs 3 to 4 weeks and exists to settle the authorization questions before the production build commits.
What we do during the sprint.
- Sit with your AI review office and the program team. Confirm scoping: impact level, M-24-10 categorization, applicable agency-specific policies.
- Audit the planned build against the NIST AI RMF practices. Identify which are already covered, which need work, which are missed.
- Confirm FedRAMP boundary choice. Validate the model and provider against the boundary's authorization.
- Draft the System Security Plan outline, the AI RMF mapping skeleton, and the impact assessment outline.
- Run a pilot eval pack on representative data. Bias tests, robustness tests, a small-scale red team. Identify gaps that block authorization.
- Walk the package with the agency AI review office for early feedback before the full assessment.
- Produce the fixed plan and fixed price for the production governance work alongside the build.
What you walk away with.
- Scoping document confirming impact level and applicable frameworks
- NIST AI RMF gap analysis with prioritized remediation list
- FedRAMP boundary architecture diagram with model placement
- System Security Plan outline ready for full implementation
- Pilot eval pack with initial bias, robustness, and red-team findings
- Impact assessment outline ready for the privacy office
- Documented early feedback from the agency AI review office
- Fixed plan and fixed price for the production governance package
If we are the right partner and the math works, you greenlight the build. If we are not, or if the math does not work, you keep the artifacts. The scoping, the gap analysis, the eval pack, the SSP outline. You can hand it to another vendor or use it to inform your own work. We have not earned the next engagement and we do not pretend we have.
How it lands per audience.
The frameworks above are universal. The shape of the engagement is not. Three audience-specific deep dives are in progress. Below is the one-line version of how this capability lands per audience.
Set up the agency's own AI governance program. Policy alignment with M-24-10, AI review board operating model, eval pack templates, agency-wide inventory and monitoring. Often paired with the first production AI authorization as the proving exercise.
Deep dive coming. Book a Call to discuss now.
Governance work bundled with the AI capability we sub into your delivery contract. NIST AI RMF mapping, eval pack, agency review participation. Your prime gets one delivery team. The agency sees one governance package.
Deep dive coming. Book a Call to discuss now.
FedRAMP marketplace listing strategy, agency-specific authorization layer, governance documentation that agency procurement teams can consume. The package that converts FedRAMP-authorized into agency-deployed.
Deep dive coming. Book a Call to discuss now.
Frequently asked.
An AI Authority to Operate is the formal authorization from an agency's Authorizing Official that the AI system can run in a production environment. The process layers a standard NIST 800-53 control set on top of FedRAMP boundary controls, then adds AI-specific overlays from NIST AI RMF (with the AI 600-1 generative AI profile where applicable), OMB M-24-10 risk-management requirements, and any agency-specific AI governance policies. Deliverables are a System Security Plan, a NIST AI RMF mapping, an eval pack with red-team results, an AI impact assessment, and ongoing monitoring artifacts. Timeline runs 4 to 12 months depending on data sensitivity and the agency's AI review pipeline maturity.
Systems that produce or process government data on government networks need an authorization to operate. The form varies. A research prototype on a sandbox does not need a full ATO. A production AI system that affects benefits decisions, public-facing services, or sensitive operations always does. OMB M-24-10 expanded the scope to include rights-impacting and safety-impacting AI use cases, which now carry additional governance burden including impact assessments, public inventory entries, and bias testing.
Azure OpenAI Service in Azure Government carries FedRAMP High and offers GPT-4 class models. AWS Bedrock in GovCloud carries FedRAMP High and offers Claude, Llama, and Titan model families. Google Vertex AI on Assured Workloads supports FedRAMP boundary deployment. For open-weight models you can self-host inside an authorized boundary on AWS GovCloud or Azure Government and inherit the boundary's authorization. Choice is a function of workload sensitivity, the agency's preferred cloud, and the latency budget.
NIST AI RMF is the voluntary framework that defines four functions (Govern, Map, Measure, Manage) and the practices under each. OMB M-24-10 is the binding policy memorandum that requires federal agencies to apply AI risk management to rights-impacting and safety-impacting AI. M-24-10 cites NIST AI RMF as the recommended framework, so a credible governance package maps every M-24-10 requirement to a specific NIST AI RMF practice and produces the evidence that the practice was implemented.
The eval pack is the bundled evidence that the AI system meets its accuracy, safety, fairness, and security commitments. It contains baseline accuracy benchmarks, bias and fairness tests across protected classes, robustness tests against adversarial inputs, red-team findings and remediation, prompt-injection test results, hallucination rate measurements, drift detection thresholds, and a monitoring runbook. The agency's AI review office reads it first; the Authorizing Official reads the summary. A good eval pack is under 40 pages of executive-readable content with technical appendices.
Low-sensitivity AI on a mature FedRAMP boundary with a well-staffed AI review office: 4 to 6 months from kickoff to authorization. Moderate-sensitivity AI requiring impact assessment and agency-specific review: 6 to 9 months. High-sensitivity AI requiring custom controls or new agency policy: 9 to 18 months. Plan around the realistic end of the range and start the conversation with the AI review office on day one to compress the timeline.
Yes. The eval suite, the SSP, the AI RMF mapping, the monitoring dashboards, the templates for future systems all transfer to your team. We document the patterns as we go and run a handoff so your team can apply them to the next AI capability without us in the loop. Most agencies use the first authorization as the template for everything that follows.
Twenty minutes. Bring the use case, the data sensitivity, the agency, and the timeline. We will tell you whether a discovery sprint is the right next step.