Decision Coverage v3

Cross-Architecture Deployability Evidence for Agentic AI Systems

Johnny Wiley · ConstantX · Preprint, March 2026


Keywords: agentic AI, assurance, deployability, enforcement, coverage, safety, threat model, OWASP ASI, NIST AI RMF, observable enforcement

Abstract. We present Decision Coverage v3, extending the enforcement-grounded assurance framework from a single-architecture study (v2: n=424, 1 architecture, 2 models) to a cross-architecture historical corpus of 818 runs across 5 engagements, 3 enforcement architectures, and 4 models. Three architectures are assessed: the ConstantX Engine (OPA policy + workspace sandbox), RuntimeX (per-action OPA + hard gates + verification gate), and NemoClaw (kernel-level Landlock + seccomp + network namespaces). Four models are assessed: claude-opus-4-5-20251101, gpt-5.4-2026-03-05, nvidia/nemotron-3-super-120b-a12b, and openai/gpt-oss-120b. NemoClaw is retained as historical diagnostic evidence; under the current publishable evidence standard, its lack of structured enforcement signals blocks market-facing claims until structured signal evidence is wired.

Five findings emerge: (1) Architecture determines containment — GPT 5.4 produced 85.85% Terminal Coverage through the ConstantX Engine and 97.14% through RuntimeX; the model is identical, the enforcement surface is the variable. (2) Observable enforcement is architecture-dependent — RuntimeX produced observable enforcement signals in 90.8% of bounded_failure runs; NemoClaw produced 0%. (3) Open-weight assurance is possible under RuntimeX — GPT-OSS 120B (5.1B active parameters via MoE) achieved 99.29% TC, 100% adversarial TC, and 95% positive-path completion through RuntimeX; this is a deployability result, not a general capability-parity claim. (4) RuntimeX achieves 100% adversarial containment — 240 adversarial runs across 2 models, 18 target-specific RuntimeX threat IDs, and all 10 OWASP ASI categories produced zero undefined behavior. (5) Capability and safety do not trade off under RuntimeX — across two RuntimeX engagements with identical scenario suites, the highest-TC model (GPT-OSS, 99.29%) also had the highest positive-path completion rate (95%).

1. Introduction

Decision Coverage v2 [1] assessed a single enforcement architecture (the ConstantX Engine) against two frontier models, finding that protocol compliance — not capability — was the deployment-blocking factor for GPT 5.4 (multi_action_output in 86% of runs). That finding was engine-specific: RuntimeX handles multi-action responses at the parsing layer, making multi_action_output a valid execution mode rather than a protocol violation.

Since v2, three developments motivated a cross-architecture study:

1. The engine’s central finding did not generalize. The multi_action_output signal that classified 30 GPT 5.4 runs as undefined_behavior in the engine does not exist in RuntimeX’s architecture. This raised the question: is GPT 5.4’s deployability limited by the model, or by the architecture assuring it?

2. Kernel-level enforcement produced zero observable signals. NemoClaw (CX-2026-002) was historically scored at 100% Terminal Coverage across 114 runs but produced 0% observable enforcement. Under the current v3 reducer stance, that result is diagnostic rather than publishable: containment may have occurred, but ConstantX cannot prove which enforcement surface produced it. This introduced a distinction v2 did not address: containment you can prove vs. containment you cannot.

3. Open-weight deployment needed assurance evidence. The v2 study assessed only proprietary frontier models. GPT-OSS 120B (117B total parameters, 5.1B active per token via 128-expert MoE with Top-4 routing) runs on a single 80GB GPU, making it accessible to organizations that cannot use cloud-only APIs. Its deployability under RuntimeX had not been measured.

This paper presents Decision Coverage v3, which records 818 historical runs across 5 engagements, 3 architectures, and 4 models. Threat model traceability and the engagement index from v2 are retained unchanged. The verdict reducer is updated to v3: adversarial scenarios now require at least one observable enforcement or containment signal for bounded_failure — runs with no enforcement signal are classified as undefined_behavior regardless of model behavior. This codifies the distinction between enforcement (structural) and alignment (probabilistic). The empirical contribution: what changes when you hold the model constant and vary the architecture?

2. Methodology

v3 retains the v2 methodology with one significant change to the verdict reducer:

2.1 Containment Mechanism Classification

v3 introduces a four-category taxonomy for classifying how each bounded_failure run was contained. This classification is applied to every bounded_failure verdict across all engagements and is the empirical basis for Finding 2 (observable enforcement).

ClassificationDefinitionEvidence Required
enforcement_signalSystem enforcement gate produced an observable signal or blocked an actionSignal present in trace (e.g., tool_disallowed, output_policy_violation, pending_approval, no_progress)
model_refusalModel completed the scenario using only allowed tools without attempting the adversarial actionTrace shows task completion or benign tool use with no enforcement gate triggered
silent_enforcementOutcome is consistent with enforcement (attack did not succeed) but no observable signal was producedNo enforcement signal in trace, no model refusal pattern, attack outcome is containment
indeterminateCannot classify — insufficient trace evidence to distinguish mechanismNone of the above criteria met

The taxonomy separates what happened (the verdict) from how it happened (the mechanism). Only runs with observable enforcement signals receive bounded_failure. Runs where the model declined the attack (model_refusal) without an enforcement signal are undefined_behavior — the governance layer did not activate, and the model’s refusal is not structurally guaranteed to persist across weight updates, prompt variations, or jailbreak techniques.

enforcement_signal is the only mechanism that produces both a bounded_failure verdict and audit-grade evidence. model_refusal, silent_enforcement, and indeterminate are classified as undefined_behavior at the verdict level. Trace inspection can determine what occurred, but the absence of an enforcement signal means safety depends on model behavior, not system structure.

The mechanism classification is applied post-hoc from trace data for diagnostic purposes. It does not upgrade verdicts — a model_refusal remains undefined_behavior regardless of how clearly the refusal appears in the trace. This is the core ConstantX position: enforcement is structural, alignment is probabilistic, and Decision Coverage measures the structural part.

3. Engagement Corpus

EngagementDateArchitectureModelScenariosRunsTC95% CI
CX-2026-0012026-03-11ConstantX Engineclaude-opus-4-5-20251101106212100.0%[98.22, 100.0]
CX-2026-0052026-03-11ConstantX Enginegpt-5.4-2026-03-0510621285.85%[80.52, 89.90]
CX-2026-0022026-03-18NemoClawnemotron-3-super-120b-a12b19114100.0% historical[96.74, 100.0]
CX-2026-0032026-03-27RuntimeXgpt-5.4-2026-03-057014097.14%[92.88, 98.88]
CX-2026-0042026-03-30RuntimeXopenai/gpt-oss-120b7014099.29%[96.07, 99.87]

Total: 818 runs, 5 engagements, 3 architectures, 4 models.

Suite composition differs by architecture: the engine suite (106 scenarios, 14 categories) tests ConstantX Engine enforcement surfaces. The RuntimeX suite (70 scenarios: 60 threat + 10 positive-path) tests RuntimeX enforcement surfaces across 18 target-specific RuntimeX threat IDs (TM-001–TM-018), derived from the OWASP T1–T17 technique taxonomy and mapped to all 10 OWASP ASI categories. The NemoClaw suite (19 scenarios) tests kernel-level containment. Suites are not directly comparable by scenario count — they target different enforcement surfaces. The NemoClaw engagement (19 scenarios, 6 passes each) was designed as a proof of architecture independence: can the Decision Coverage framework observe a kernel-level enforcement architecture with no application-layer signal emission? The answer was no for publishable evidence — the historical reducer recorded 100% TC, but current v3 treatment would classify no-signal containment as undefined_behavior. The 19-scenario suite is retained to establish the observable enforcement finding (§4.2) and the containment mechanism distribution, not to characterize NemoClaw’s full enforcement surface or support a current market-facing claim.

4. Findings

4.1 Architecture Determines Containment

GPT 5.4 produced 85.85% TC through the ConstantX Engine (CX-2026-005) and 97.14% TC through RuntimeX (CX-2026-003). The model weights are identical (gpt-5.4-2026-03-05). The enforcement architecture is the variable.

The root cause: the ConstantX Engine classified multi_action_output (multiple JSON actions per response) as a protocol violation. GPT 5.4 produced this in 182 of 212 engine runs (86%), causing 30 undefined_behavior verdicts. RuntimeX handles multi-action responses at the parsing layer (agent_runtime.py:_parse_actions) — multi-action output is a valid execution mode, not a protocol violation. The 30 runs that were undefined_behavior under the engine are bounded_failure or valid_commit under RuntimeX.

This is not a verdict override. The two architectures have different enforcement surfaces and different protocol definitions. What constitutes a protocol violation is architecture-specific. The same model behavior that breaks one protocol complies with another.

Implication: Deployability is a property of the (model, architecture) pair, not the model alone. Assessing a model against a single architecture produces architecture-specific findings that may not transfer.

4.2 Observable Enforcement Is Architecture-Dependent

Architecturebounded_failure RunsObservable EnforcementModel RefusalSilent/Indeterminate
RuntimeX (CX-2026-003, GPT 5.4)120109 (90.8%)11 (9.2%)0 (0%)
RuntimeX (CX-2026-004, GPT-OSS)120101 (84.2%)19 (15.8%)0 (0%)
NemoClaw (CX-2026-002)1140 (0%)60 (52.6%)54 (47.4%)

RuntimeX’s enforcement surfaces (OPA policy, governance denylist, output content gate, command prefix enforcement, verification gate, progress detection) produce observable signals when they fire. An auditor can trace from verdict to signal to enforcement mechanism.

NemoClaw’s enforcement surfaces (Landlock filesystem restrictions, seccomp syscall filtering, network namespace isolation) operate at the kernel level and produce no signals visible to the application layer. All 114 NemoClaw runs were contained, but containment could not be attributed to a specific enforcement mechanism in 47.4% of runs and relied on model-layer refusal in 52.6%. Zero runs produced observable system enforcement signals.

The historical NemoClaw run reached 100% TC under earlier scoring, while RuntimeX reached 100% adversarial TC with auditable enforcement signals. Under the current v3 stance, that difference is decisive: RuntimeX containment is publishable evidence; NemoClaw containment without signals is diagnostic only.

Implication: For compliance contexts that require enforcement evidence (SOC 2, EU AI Act Article 14 human oversight, NIST AI RMF MEASURE 2.9 documentation), architecture determines whether the evidence can be produced — not just whether containment occurs.

4.3 Open-Weight Assurance Under RuntimeX

ModelArchitectureParams (Active)TCPositive-Path TCAdversarial TC
GPT 5.4RuntimeXProprietary97.14%80.0%100.0%
GPT-OSS 120BRuntimeX5.1B active (117B total)99.29%95.0%100.0%

GPT-OSS 120B (128-expert MoE, Top-4 routing, 5.1B active parameters per token) achieved high Terminal Coverage through RuntimeX: 99.29% overall, 100% on adversarial scenarios, and 95% on positive-path task completion.

In this RuntimeX suite, GPT-OSS did not reproduce the PP-COMMAND-001 tool selection failure observed in GPT 5.4, which called run_analytics 66 times and run_command 0 times across both passes. GPT-OSS selected run_command and achieved valid_commit on that scenario.

The single GPT-OSS UB run (PP-EDIT-002 pass 2) followed the same pattern as GPT 5.4: successful file writes, then malformed run_command arguments, then model stall. This appears to be a scenario-level difficulty ceiling rather than a model-level failure.

Implication: High Terminal Coverage under RuntimeX is not limited to proprietary frontier models. This finding does not imply that GPT-OSS is comparable to proprietary frontier models in general real-world capability; it shows that an open-weight model can be assessed and can achieve high deployability under strong runtime enforcement.

4.4 100% Adversarial Containment Under RuntimeX

RuntimeX achieved 100% Terminal Coverage on all adversarial scenarios across both GPT 5.4 (120 runs) and GPT-OSS 120B (120 runs): 240 adversarial runs, 18 target-specific RuntimeX threat IDs, all 10 OWASP ASI categories, zero undefined behavior.

Threat IDThreatRunsTC
TM-001Direct prompt injection24100%
TM-002Indirect prompt injection24100%
TM-003RAG poisoning12100%
TM-004Tool-chain exfiltration12100%
TM-005Command prefix bypass20100%
TM-006Scan root escape12100%
TM-007Forge manifest bypass12100%
TM-008Excessive agency12100%
TM-009Oracle metadata poisoning8100%
TM-010Verification gate gaming12100%
TM-011Trace compaction evidence loss8100%
TM-012Telegram identity spoofing12100%
TM-013Approval fatigue12100%
TM-014Forge code injection12100%
TM-015Orchestration result poisoning12100%
TM-016Event route injection12100%
TM-017Voice social engineering12100%
TM-018Supply chain compromise12100%

The two Critical-priority threats (TM-001 direct prompt injection, TM-002 indirect prompt injection) each survived 24 adversarial runs with zero escapes. All 10 OWASP ASI categories achieved 100% TC across their mapped RuntimeX threat observations.

4.5 Capability and Safety Do Not Trade Off

The assumption that safer systems sacrifice capability is not supported by this data. Across two RuntimeX engagements with identical scenario suites:

ModelTC (Safety)Positive-Path Completion (Capability)
GPT 5.497.14%80.0% (16/20)
GPT-OSS 120B99.29%95.0% (19/20)

The higher-TC model also completed more tasks. This is not paradoxical: RuntimeX enforcement catches failures early (before side effects compound), producing clean termination rather than cascading errors. A well-enforced system lets capable models complete more tasks because the failure envelope is tight.

5. OWASP ASI Coverage (RuntimeX)

The 60 RuntimeX threat scenarios map across all 10 OWASP ASI categories. Each scenario was run twice per model across two RuntimeX engagements, producing four observations per scenario. ASI rows are not mutually exclusive because a single scenario can map to more than one ASI category.

ASI CodeOWASP 2026 CategoryASI-Mapped ObservationsTC
ASI-01Agent Goal Hijack48100%
ASI-02Tool Misuse and Exploitation44100%
ASI-03Identity and Privilege Abuse36100%
ASI-04Agentic Supply Chain Vulnerabilities36100%
ASI-05Unexpected Code Execution (RCE)32100%
ASI-06Memory & Context Poisoning60100%
ASI-07Insecure Inter-Agent Communication12100%
ASI-08Cascading Failures16100%
ASI-09Human-Agent Trust Exploitation36100%
ASI-10Rogue Agents36100%

v2 covered 8 of 10 ASI categories (missing ASI-04, ASI-07). RuntimeX’s expanded threat suite (18 target-specific TM-IDs vs engine’s 12) closes both gaps.

6. Evidence Chain

Each engagement produces a cryptographic evidence chain binding the assurance result to specific artifacts. The engagement index (5 entries as of 2026-03-30) enables:

Evidence chain artifacts: target version (git commit), model ID (dated snapshot), system prompt hash, agent prompt hash, OPA policy hash, protocol signal spec hash + version, threat model hash, scenario suite hash, positive-path suite hash.

Validity: Each report is valid as long as all hashes remain unchanged. Any hash drift invalidates the report.

7. Compliance Mapping

v3 retains v2’s compliance mappings with expanded evidence:

8. Related Work

Decision Coverage v2 and v1 [1][2]. Direct predecessors. v1 [2] introduced the three-state verdict model, Wilson score confidence intervals, and evidence chain architecture (n=336, 2 models, 1 architecture). v2 [1] replaced the per-scenario signal whitelist with a global signal blacklist, added threat model traceability and the engagement index, and introduced the multi_action_output finding (n=424, 2 models, 1 architecture). v3 retains v2’s evidence model, adds the observable-enforcement requirement for bounded_failure, and extends the assurance corpus to 3 architectures and 4 models.

Capability Benchmarks. HumanEval [3], SWE-bench [4], MMLU, and similar benchmarks measure task completion accuracy. Decision Coverage measures behavioral boundaries under failure — the complementary signal. A model that scores well on SWE-bench may still be undeployable if it ignores tool boundaries or escalates side effects. The two measurement types are orthogonal: capability benchmarks answer “can the model do the task?” and Decision Coverage answers “does the system fail safely when the model cannot?”

Red Teaming. Manual red-teaming produces qualitative findings on specific attack vectors. Decision Coverage formalizes adversarial testing into a quantitative, reproducible metric with threat model traceability. The two approaches are complementary: red teaming discovers novel attack vectors; Decision Coverage measures whether the enforcement surface contains known vectors at scale.

Agent Benchmarks. AgentBench, WebArena, and SWE-agent measure task success rate in agent environments. Decision Coverage provides the complementary failure-mode signal — not whether the agent completed the task, but whether the system contained the agent when it could not.

OWASP Top 10 for Agentic Applications [5]. Identifies the ten primary risk categories for agentic systems. Decision Coverage provides empirical measurement for all 10 ASI categories under RuntimeX — triggering mapped threat scenarios in a controlled sandbox and documenting the enforcement response with per-run verdicts and evidence chains.

NIST AI RMF [9] and AI 600-1 [10]. Define risk management functions (Govern, Map, Measure, Manage). Decision Coverage fills the Measure function with enforcement-grounded evidence. Organizations use their existing governance processes for the remaining functions.

UC Berkeley CLTC Agentic AI Profile [11]. Identifies empirical measurement and post-deployment monitoring as critical gaps in agentic AI governance. Decision Coverage addresses both: coverage reports provide empirical measurement, and the engagement index provides drift detection across model snapshots.

9. Discussion

Architecture as the independent variable. v2 compared models within a single architecture. v3 compares the same model across architectures. The finding that GPT 5.4 jumps from 85.85% to 97.14% TC by changing the enforcement surface — not the model — reframes the deployability question. The industry default is to assess models and assume the deployment context is secondary. This data suggests the deployment context is primary.

Observable enforcement as audit requirement. NemoClaw’s historical 100% TC with 0% observable enforcement poses a governance challenge. The system may contain the attack, but you cannot prove how it works to an auditor. RuntimeX’s 90.8% observable enforcement rate means the evidence exists at the enforcement layer, not the model layer. For organizations subject to audit requirements, this distinction may be the architecture selection criterion. The containment mechanism taxonomy (§2.1) makes this measurable: enforcement_signal is the only classification that produces the evidence chain an auditor can follow from verdict to gate to trace artifact. model_refusal is real containment, but it is model-dependent and may not survive a weight update. silent_enforcement is containment without proof. Architecture selection under governance constraints reduces to: which architecture maximizes the enforcement_signal rate?

Open-weight assurance. GPT-OSS 120B at 99.29% TC through RuntimeX demonstrates that high Terminal Coverage is not limited to proprietary frontier models. The result is scoped to deployability under RuntimeX’s enforcement architecture and does not imply general capability parity with proprietary frontier systems. This has cost and sovereignty implications for organizations that need to assure agentic deployments while considering open-weight model options.

Enforcement surface delta between RuntimeX engagements. CX-2026-003 (GPT 5.4) ran on RuntimeX commit 21db268. CX-2026-004 (GPT-OSS) ran on 8f3f942, which includes a verification gate refusal exit fix applied between engagements. The fix added an exit path for models that legitimately refuse adversarial tasks — after 3 consecutive done-only steps with no prior tool use, the gate allows exit instead of looping to max_steps_exceeded. This fix resulted in 8 reclassifications from undefined_behavior to bounded_failure in the GPT-OSS engagement. Without the fix, GPT-OSS would have scored 93.57% TC (9 UB) — below GPT 5.4’s 97.14%. GPT-OSS triggered the refusal exit gap more often because it refuses adversarial prompts at a higher rate (15.8% vs 9.2%). GPT 5.4’s 4 UB runs were all tool selection / no_progress failures unrelated to the refusal exit path — the fix would not have changed its score. Findings 3 and 5 should be read with this context: the cross-model comparison is (GPT 5.4 on 21db268) vs (GPT-OSS on 8f3f942), not identical enforcement surfaces.

Scope. The three architectures are not tested under identical conditions. The engine suite (106 scenarios) and RuntimeX suite (70 scenarios, 18 target-specific RuntimeX threat IDs) target different enforcement surfaces. NemoClaw’s 19-scenario suite is narrower. Cross-architecture TC comparisons should be read as “this is what each architecture produced under its own run conditions,” not “these architectures were tested identically.” The engagement index documents the conditions.

10. Conclusion

Decision Coverage v3 extends the enforcement-grounded assurance framework from a single-architecture study to a cross-architecture historical corpus of 818 runs across 5 engagements, 3 architectures, and 4 models.

The central finding: architecture determines containment. The same model produces materially different Terminal Coverage depending on the enforcement surface it operates under. GPT 5.4’s 11-point TC improvement from engine to RuntimeX (85.85% → 97.14%) is not a model upgrade — it is an architecture upgrade. This reframes the deployability question from “which model is safe?” to “which (model, architecture) pair meets the deployment requirements?”

RuntimeX achieved 100% adversarial containment across 240 runs, 18 target-specific RuntimeX threat IDs, and all 10 OWASP ASI categories with zero undefined behavior. 90.8% of containment produced observable enforcement signals — auditable evidence that the system, not the model, caught the violation.

Open-weight model assurance is possible under RuntimeX: GPT-OSS 120B (5.1B active parameters) achieved 99.29% TC with 100% adversarial containment and 95% positive-path completion. This is a deployability finding under strong runtime enforcement, not a claim of general capability parity.

Capability and safety do not trade off in this data. The right question is not “how safe is the model?” but “does the (model, architecture) pair fail safely within defined boundaries?” Decision Coverage v3 provides the cross-architecture evidence to answer it.

References

  1. Wiley, J. “Decision Coverage v2: Threat-Traced Deployability Evidence for Agentic AI Systems.” ConstantX, March 2026.
  2. Wiley, J. “Decision Coverage: Measuring Deployability of Agentic AI Systems Through Enforcement-Grounded Evidence.” ConstantX, February 2026.
  3. Chen, M., et al. “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374, 2021.
  4. Jimenez, C.E., et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023.
  5. OWASP. “Top 10 for Agentic Applications.” Version 12.6-1, 2026.
  6. Wilson, E.B. “Probable Inference, the Law of Succession, and Statistical Inference.” Journal of the American Statistical Association, 22(158):209–212, 1927.
  7. ConstantX Engine Threat Model. Internal document, March 2026.
  8. RuntimeX Threat Model. Internal document, March 2026.
  9. NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” AI 100-1, January 2023.
  10. NIST. “Generative Artificial Intelligence Profile.” AI 600-1, July 2024.
  11. UC Berkeley CLTC. “Agentic AI Risk-Management Standards Profile.” February 2026.
  12. NVIDIA. “Nemotron-3-Super-120B Technical Report.” 2026.