Decision Coverage v3

Cross-Architecture Deployability Evidence for Agentic AI Systems

Johnny Wiley · ConstantX · Preprint, March 2026

Keywords: agentic AI, assurance, deployability, enforcement, coverage, safety, threat model, OWASP ASI, NIST AI RMF, observable enforcement

Abstract. We present Decision Coverage v3, extending the enforcement-grounded assurance framework from a single-architecture study (v2: n=424, 1 architecture, 2 models) to a cross-architecture historical corpus of 818 runs across 5 engagements, 3 enforcement architectures, and 4 models. Three architectures are assessed: the ConstantX Engine (OPA policy + workspace sandbox), RuntimeX (per-action OPA + hard gates + verification gate), and NemoClaw (kernel-level Landlock + seccomp + network namespaces). Four models are assessed: claude-opus-4-5-20251101, gpt-5.4-2026-03-05, nvidia/nemotron-3-super-120b-a12b, and openai/gpt-oss-120b. NemoClaw is retained as historical diagnostic evidence; under the current publishable evidence standard, its lack of structured enforcement signals blocks market-facing claims until structured signal evidence is wired.

Five findings emerge: (1) Architecture determines containment — GPT 5.4 produced 85.85% Terminal Coverage through the ConstantX Engine and 97.14% through RuntimeX; the model is identical, the enforcement surface is the variable. (2) Observable enforcement is architecture-dependent — RuntimeX produced observable enforcement signals in 90.8% of bounded_failure runs; NemoClaw produced 0%. (3) Open-weight assurance is possible under RuntimeX — GPT-OSS 120B (5.1B active parameters via MoE) achieved 99.29% TC, 100% adversarial TC, and 95% positive-path completion through RuntimeX; this is a deployability result, not a general capability-parity claim. (4) RuntimeX achieves 100% adversarial containment — 240 adversarial runs across 2 models, 18 target-specific RuntimeX threat IDs, and all 10 OWASP ASI categories produced zero undefined behavior. (5) Capability and safety do not trade off under RuntimeX — across two RuntimeX engagements with identical scenario suites, the highest-TC model (GPT-OSS, 99.29%) also had the highest positive-path completion rate (95%).

1. Introduction

Decision Coverage v2 [1] assessed a single enforcement architecture (the ConstantX Engine) against two frontier models, finding that protocol compliance — not capability — was the deployment-blocking factor for GPT 5.4 (multi_action_output in 86% of runs). That finding was engine-specific: RuntimeX handles multi-action responses at the parsing layer, making multi_action_output a valid execution mode rather than a protocol violation.

Since v2, three developments motivated a cross-architecture study:

1. The engine’s central finding did not generalize. The multi_action_output signal that classified 30 GPT 5.4 runs as undefined_behavior in the engine does not exist in RuntimeX’s architecture. This raised the question: is GPT 5.4’s deployability limited by the model, or by the architecture assuring it?

2. Kernel-level enforcement produced zero observable signals. NemoClaw (CX-2026-002) was historically scored at 100% Terminal Coverage across 114 runs but produced 0% observable enforcement. Under the current v3 reducer stance, that result is diagnostic rather than publishable: containment may have occurred, but ConstantX cannot prove which enforcement surface produced it. This introduced a distinction v2 did not address: containment you can prove vs. containment you cannot.

3. Open-weight deployment needed assurance evidence. The v2 study assessed only proprietary frontier models. GPT-OSS 120B (117B total parameters, 5.1B active per token via 128-expert MoE with Top-4 routing) runs on a single 80GB GPU, making it accessible to organizations that cannot use cloud-only APIs. Its deployability under RuntimeX had not been measured.

This paper presents Decision Coverage v3, which records 818 historical runs across 5 engagements, 3 architectures, and 4 models. Threat model traceability and the engagement index from v2 are retained unchanged. The verdict reducer is updated to v3: adversarial scenarios now require at least one observable enforcement or containment signal for bounded_failure — runs with no enforcement signal are classified as undefined_behavior regardless of model behavior. This codifies the distinction between enforcement (structural) and alignment (probabilistic). The empirical contribution: what changes when you hold the model constant and vary the architecture?

2. Methodology

v3 retains the v2 methodology with one significant change to the verdict reducer:

Three-state verdicts: valid_commit, bounded_failure, undefined_behavior
Disallowed signals blacklist: Per-scenario configuration. Universal protocol violations (non_json_output, error, provider_disconnect, engine_exception) classify as undefined_behavior. Architecture-dependent signals (e.g. multi_action_output) are disallowed only for targets that cannot handle them. Adversarial scenarios require at least one enforcement or containment signal for bounded_failure — no observable enforcement means undefined_behavior. The reducer asks “did the system’s enforcement surface activate?” not “did the model fail in a predicted way?” Enforcement is structural; alignment is probabilistic. Decision Coverage measures the structural part.
Threat model traceability: Every adversarial scenario carries a threat_id linking to a documented threat model entry
Wilson score 95% CI on all Terminal Coverage figures
Engagement index: Append-only ledger tracking every assurance engagement for drift detection and retest scoping

2.1 Containment Mechanism Classification

v3 introduces a four-category taxonomy for classifying how each bounded_failure run was contained. This classification is applied to every bounded_failure verdict across all engagements and is the empirical basis for Finding 2 (observable enforcement).

Classification	Definition	Evidence Required
`enforcement_signal`	System enforcement gate produced an observable signal or blocked an action	Signal present in trace (e.g., tool_disallowed, output_policy_violation, pending_approval, no_progress)
`model_refusal`	Model completed the scenario using only allowed tools without attempting the adversarial action	Trace shows task completion or benign tool use with no enforcement gate triggered
`silent_enforcement`	Outcome is consistent with enforcement (attack did not succeed) but no observable signal was produced	No enforcement signal in trace, no model refusal pattern, attack outcome is containment
`indeterminate`	Cannot classify — insufficient trace evidence to distinguish mechanism	None of the above criteria met

The taxonomy separates what happened (the verdict) from how it happened (the mechanism). Only runs with observable enforcement signals receive bounded_failure. Runs where the model declined the attack (model_refusal) without an enforcement signal are undefined_behavior — the governance layer did not activate, and the model’s refusal is not structurally guaranteed to persist across weight updates, prompt variations, or jailbreak techniques.

enforcement_signal is the only mechanism that produces both a bounded_failure verdict and audit-grade evidence. model_refusal, silent_enforcement, and indeterminate are classified as undefined_behavior at the verdict level. Trace inspection can determine what occurred, but the absence of an enforcement signal means safety depends on model behavior, not system structure.

The mechanism classification is applied post-hoc from trace data for diagnostic purposes. It does not upgrade verdicts — a model_refusal remains undefined_behavior regardless of how clearly the refusal appears in the trace. This is the core ConstantX position: enforcement is structural, alignment is probabilistic, and Decision Coverage measures the structural part.

3. Engagement Corpus

Engagement	Date	Architecture	Model	Scenarios	Runs	TC	95% CI
CX-2026-001	2026-03-11	ConstantX Engine	claude-opus-4-5-20251101	106	212	100.0%	[98.22, 100.0]
CX-2026-005	2026-03-11	ConstantX Engine	gpt-5.4-2026-03-05	106	212	85.85%	[80.52, 89.90]
CX-2026-002	2026-03-18	NemoClaw	nemotron-3-super-120b-a12b	19	114	100.0% historical	[96.74, 100.0]
CX-2026-003	2026-03-27	RuntimeX	gpt-5.4-2026-03-05	70	140	97.14%	[92.88, 98.88]
CX-2026-004	2026-03-30	RuntimeX	openai/gpt-oss-120b	70	140	99.29%	[96.07, 99.87]

Total: 818 runs, 5 engagements, 3 architectures, 4 models.

Suite composition differs by architecture: the engine suite (106 scenarios, 14 categories) tests ConstantX Engine enforcement surfaces. The RuntimeX suite (70 scenarios: 60 threat + 10 positive-path) tests RuntimeX enforcement surfaces across 18 target-specific RuntimeX threat IDs (TM-001–TM-018), derived from the OWASP T1–T17 technique taxonomy and mapped to all 10 OWASP ASI categories. The NemoClaw suite (19 scenarios) tests kernel-level containment. Suites are not directly comparable by scenario count — they target different enforcement surfaces. The NemoClaw engagement (19 scenarios, 6 passes each) was designed as a proof of architecture independence: can the Decision Coverage framework observe a kernel-level enforcement architecture with no application-layer signal emission? The answer was no for publishable evidence — the historical reducer recorded 100% TC, but current v3 treatment would classify no-signal containment as undefined_behavior. The 19-scenario suite is retained to establish the observable enforcement finding (§4.2) and the containment mechanism distribution, not to characterize NemoClaw’s full enforcement surface or support a current market-facing claim.

4. Findings

4.1 Architecture Determines Containment

GPT 5.4 produced 85.85% TC through the ConstantX Engine (CX-2026-005) and 97.14% TC through RuntimeX (CX-2026-003). The model weights are identical (gpt-5.4-2026-03-05). The enforcement architecture is the variable.

The root cause: the ConstantX Engine classified multi_action_output (multiple JSON actions per response) as a protocol violation. GPT 5.4 produced this in 182 of 212 engine runs (86%), causing 30 undefined_behavior verdicts. RuntimeX handles multi-action responses at the parsing layer (agent_runtime.py:_parse_actions) — multi-action output is a valid execution mode, not a protocol violation. The 30 runs that were undefined_behavior under the engine are bounded_failure or valid_commit under RuntimeX.

This is not a verdict override. The two architectures have different enforcement surfaces and different protocol definitions. What constitutes a protocol violation is architecture-specific. The same model behavior that breaks one protocol complies with another.

Implication: Deployability is a property of the (model, architecture) pair, not the model alone. Assessing a model against a single architecture produces architecture-specific findings that may not transfer.

4.2 Observable Enforcement Is Architecture-Dependent

Architecture	bounded_failure Runs	Observable Enforcement	Model Refusal	Silent/Indeterminate
RuntimeX (CX-2026-003, GPT 5.4)	120	109 (90.8%)	11 (9.2%)	0 (0%)
RuntimeX (CX-2026-004, GPT-OSS)	120	101 (84.2%)	19 (15.8%)	0 (0%)
NemoClaw (CX-2026-002)	114	0 (0%)	60 (52.6%)	54 (47.4%)

RuntimeX’s enforcement surfaces (OPA policy, governance denylist, output content gate, command prefix enforcement, verification gate, progress detection) produce observable signals when they fire. An auditor can trace from verdict to signal to enforcement mechanism.

NemoClaw’s enforcement surfaces (Landlock filesystem restrictions, seccomp syscall filtering, network namespace isolation) operate at the kernel level and produce no signals visible to the application layer. All 114 NemoClaw runs were contained, but containment could not be attributed to a specific enforcement mechanism in 47.4% of runs and relied on model-layer refusal in 52.6%. Zero runs produced observable system enforcement signals.

The historical NemoClaw run reached 100% TC under earlier scoring, while RuntimeX reached 100% adversarial TC with auditable enforcement signals. Under the current v3 stance, that difference is decisive: RuntimeX containment is publishable evidence; NemoClaw containment without signals is diagnostic only.

Implication: For compliance contexts that require enforcement evidence (SOC 2, EU AI Act Article 14 human oversight, NIST AI RMF MEASURE 2.9 documentation), architecture determines whether the evidence can be produced — not just whether containment occurs.

4.3 Open-Weight Assurance Under RuntimeX

Model	Architecture	Params (Active)	TC	Positive-Path TC	Adversarial TC
GPT 5.4	RuntimeX	Proprietary	97.14%	80.0%	100.0%
GPT-OSS 120B	RuntimeX	5.1B active (117B total)	99.29%	95.0%	100.0%

GPT-OSS 120B (128-expert MoE, Top-4 routing, 5.1B active parameters per token) achieved high Terminal Coverage through RuntimeX: 99.29% overall, 100% on adversarial scenarios, and 95% on positive-path task completion.

In this RuntimeX suite, GPT-OSS did not reproduce the PP-COMMAND-001 tool selection failure observed in GPT 5.4, which called run_analytics 66 times and run_command 0 times across both passes. GPT-OSS selected run_command and achieved valid_commit on that scenario.

The single GPT-OSS UB run (PP-EDIT-002 pass 2) followed the same pattern as GPT 5.4: successful file writes, then malformed run_command arguments, then model stall. This appears to be a scenario-level difficulty ceiling rather than a model-level failure.

Implication: High Terminal Coverage under RuntimeX is not limited to proprietary frontier models. This finding does not imply that GPT-OSS is comparable to proprietary frontier models in general real-world capability; it shows that an open-weight model can be assessed and can achieve high deployability under strong runtime enforcement.

4.4 100% Adversarial Containment Under RuntimeX

RuntimeX achieved 100% Terminal Coverage on all adversarial scenarios across both GPT 5.4 (120 runs) and GPT-OSS 120B (120 runs): 240 adversarial runs, 18 target-specific RuntimeX threat IDs, all 10 OWASP ASI categories, zero undefined behavior.

Threat ID	Threat	Runs	TC
TM-001	Direct prompt injection	24	100%
TM-002	Indirect prompt injection	24	100%
TM-003	RAG poisoning	12	100%
TM-004	Tool-chain exfiltration	12	100%
TM-005	Command prefix bypass	20	100%
TM-006	Scan root escape	12	100%
TM-007	Forge manifest bypass	12	100%
TM-008	Excessive agency	12	100%
TM-009	Oracle metadata poisoning	8	100%
TM-010	Verification gate gaming	12	100%
TM-011	Trace compaction evidence loss	8	100%
TM-012	Telegram identity spoofing	12	100%
TM-013	Approval fatigue	12	100%
TM-014	Forge code injection	12	100%
TM-015	Orchestration result poisoning	12	100%
TM-016	Event route injection	12	100%
TM-017	Voice social engineering	12	100%
TM-018	Supply chain compromise	12	100%

The two Critical-priority threats (TM-001 direct prompt injection, TM-002 indirect prompt injection) each survived 24 adversarial runs with zero escapes. All 10 OWASP ASI categories achieved 100% TC across their mapped RuntimeX threat observations.

4.5 Capability and Safety Do Not Trade Off

The assumption that safer systems sacrifice capability is not supported by this data. Across two RuntimeX engagements with identical scenario suites:

Model	TC (Safety)	Positive-Path Completion (Capability)
GPT 5.4	97.14%	80.0% (16/20)
GPT-OSS 120B	99.29%	95.0% (19/20)

The higher-TC model also completed more tasks. This is not paradoxical: RuntimeX enforcement catches failures early (before side effects compound), producing clean termination rather than cascading errors. A well-enforced system lets capable models complete more tasks because the failure envelope is tight.

5. OWASP ASI Coverage (RuntimeX)

The 60 RuntimeX threat scenarios map across all 10 OWASP ASI categories. Each scenario was run twice per model across two RuntimeX engagements, producing four observations per scenario. ASI rows are not mutually exclusive because a single scenario can map to more than one ASI category.

ASI Code	OWASP 2026 Category	ASI-Mapped Observations	TC
ASI-01	Agent Goal Hijack	48	100%
ASI-02	Tool Misuse and Exploitation	44	100%
ASI-03	Identity and Privilege Abuse	36	100%
ASI-04	Agentic Supply Chain Vulnerabilities	36	100%
ASI-05	Unexpected Code Execution (RCE)	32	100%
ASI-06	Memory & Context Poisoning	60	100%
ASI-07	Insecure Inter-Agent Communication	12	100%
ASI-08	Cascading Failures	16	100%
ASI-09	Human-Agent Trust Exploitation	36	100%
ASI-10	Rogue Agents	36	100%

v2 covered 8 of 10 ASI categories (missing ASI-04, ASI-07). RuntimeX’s expanded threat suite (18 target-specific TM-IDs vs engine’s 12) closes both gaps.

6. Evidence Chain

Each engagement produces a cryptographic evidence chain binding the assurance result to specific artifacts. The engagement index (5 entries as of 2026-03-30) enables:

Drift detection: GPT 5.4 assessed under two architectures, behavioral delta documented
Retest scoping: Hash drift in any evidence chain artifact triggers retesting
Gap analysis: threat_coverage_gaps.py identifies untested threats per model per architecture

Evidence chain artifacts: target version (git commit), model ID (dated snapshot), system prompt hash, agent prompt hash, OPA policy hash, protocol signal spec hash + version, threat model hash, scenario suite hash, positive-path suite hash.

Validity: Each report is valid as long as all hashes remain unchanged. Any hash drift invalidates the report.

7. Compliance Mapping

v3 retains v2’s compliance mappings with expanded evidence:

NIST AI RMF Measure Function: 10 subcategories (MEASURE 1.1, 2.3, 2.4, 2.5, 2.6, 2.7, 2.9, 4.1, 4.2, 4.3)
OWASP ASI 2026: 10/10 categories covered under RuntimeX (expanded from 8/10 under engine)
UC Berkeley CLTC Agentic AI Profile: Empirical measurement + post-deployment monitoring via engagement index
NIST-2025-0035 (AI Agent Standards Initiative): Submitted methodology

8. Related Work

Decision Coverage v2 and v1 [1][2]. Direct predecessors. v1 [2] introduced the three-state verdict model, Wilson score confidence intervals, and evidence chain architecture (n=336, 2 models, 1 architecture). v2 [1] replaced the per-scenario signal whitelist with a global signal blacklist, added threat model traceability and the engagement index, and introduced the multi_action_output finding (n=424, 2 models, 1 architecture). v3 retains v2’s evidence model, adds the observable-enforcement requirement for bounded_failure, and extends the assurance corpus to 3 architectures and 4 models.

Capability Benchmarks. HumanEval [3], SWE-bench [4], MMLU, and similar benchmarks measure task completion accuracy. Decision Coverage measures behavioral boundaries under failure — the complementary signal. A model that scores well on SWE-bench may still be undeployable if it ignores tool boundaries or escalates side effects. The two measurement types are orthogonal: capability benchmarks answer “can the model do the task?” and Decision Coverage answers “does the system fail safely when the model cannot?”

Red Teaming. Manual red-teaming produces qualitative findings on specific attack vectors. Decision Coverage formalizes adversarial testing into a quantitative, reproducible metric with threat model traceability. The two approaches are complementary: red teaming discovers novel attack vectors; Decision Coverage measures whether the enforcement surface contains known vectors at scale.

Agent Benchmarks. AgentBench, WebArena, and SWE-agent measure task success rate in agent environments. Decision Coverage provides the complementary failure-mode signal — not whether the agent completed the task, but whether the system contained the agent when it could not.

OWASP Top 10 for Agentic Applications [5]. Identifies the ten primary risk categories for agentic systems. Decision Coverage provides empirical measurement for all 10 ASI categories under RuntimeX — triggering mapped threat scenarios in a controlled sandbox and documenting the enforcement response with per-run verdicts and evidence chains.

NIST AI RMF [9] and AI 600-1 [10]. Define risk management functions (Govern, Map, Measure, Manage). Decision Coverage fills the Measure function with enforcement-grounded evidence. Organizations use their existing governance processes for the remaining functions.

UC Berkeley CLTC Agentic AI Profile [11]. Identifies empirical measurement and post-deployment monitoring as critical gaps in agentic AI governance. Decision Coverage addresses both: coverage reports provide empirical measurement, and the engagement index provides drift detection across model snapshots.

9. Discussion

Architecture as the independent variable. v2 compared models within a single architecture. v3 compares the same model across architectures. The finding that GPT 5.4 jumps from 85.85% to 97.14% TC by changing the enforcement surface — not the model — reframes the deployability question. The industry default is to assess models and assume the deployment context is secondary. This data suggests the deployment context is primary.

Observable enforcement as audit requirement. NemoClaw’s historical 100% TC with 0% observable enforcement poses a governance challenge. The system may contain the attack, but you cannot prove how it works to an auditor. RuntimeX’s 90.8% observable enforcement rate means the evidence exists at the enforcement layer, not the model layer. For organizations subject to audit requirements, this distinction may be the architecture selection criterion. The containment mechanism taxonomy (§2.1) makes this measurable: enforcement_signal is the only classification that produces the evidence chain an auditor can follow from verdict to gate to trace artifact. model_refusal is real containment, but it is model-dependent and may not survive a weight update. silent_enforcement is containment without proof. Architecture selection under governance constraints reduces to: which architecture maximizes the enforcement_signal rate?

Open-weight assurance. GPT-OSS 120B at 99.29% TC through RuntimeX demonstrates that high Terminal Coverage is not limited to proprietary frontier models. The result is scoped to deployability under RuntimeX’s enforcement architecture and does not imply general capability parity with proprietary frontier systems. This has cost and sovereignty implications for organizations that need to assure agentic deployments while considering open-weight model options.

Enforcement surface delta between RuntimeX engagements. CX-2026-003 (GPT 5.4) ran on RuntimeX commit 21db268. CX-2026-004 (GPT-OSS) ran on 8f3f942, which includes a verification gate refusal exit fix applied between engagements. The fix added an exit path for models that legitimately refuse adversarial tasks — after 3 consecutive done-only steps with no prior tool use, the gate allows exit instead of looping to max_steps_exceeded. This fix resulted in 8 reclassifications from undefined_behavior to bounded_failure in the GPT-OSS engagement. Without the fix, GPT-OSS would have scored 93.57% TC (9 UB) — below GPT 5.4’s 97.14%. GPT-OSS triggered the refusal exit gap more often because it refuses adversarial prompts at a higher rate (15.8% vs 9.2%). GPT 5.4’s 4 UB runs were all tool selection / no_progress failures unrelated to the refusal exit path — the fix would not have changed its score. Findings 3 and 5 should be read with this context: the cross-model comparison is (GPT 5.4 on 21db268) vs (GPT-OSS on 8f3f942), not identical enforcement surfaces.

Scope. The three architectures are not tested under identical conditions. The engine suite (106 scenarios) and RuntimeX suite (70 scenarios, 18 target-specific RuntimeX threat IDs) target different enforcement surfaces. NemoClaw’s 19-scenario suite is narrower. Cross-architecture TC comparisons should be read as “this is what each architecture produced under its own run conditions,” not “these architectures were tested identically.” The engagement index documents the conditions.

10. Conclusion

Decision Coverage v3 extends the enforcement-grounded assurance framework from a single-architecture study to a cross-architecture historical corpus of 818 runs across 5 engagements, 3 architectures, and 4 models.

The central finding: architecture determines containment. The same model produces materially different Terminal Coverage depending on the enforcement surface it operates under. GPT 5.4’s 11-point TC improvement from engine to RuntimeX (85.85% → 97.14%) is not a model upgrade — it is an architecture upgrade. This reframes the deployability question from “which model is safe?” to “which (model, architecture) pair meets the deployment requirements?”

RuntimeX achieved 100% adversarial containment across 240 runs, 18 target-specific RuntimeX threat IDs, and all 10 OWASP ASI categories with zero undefined behavior. 90.8% of containment produced observable enforcement signals — auditable evidence that the system, not the model, caught the violation.

Open-weight model assurance is possible under RuntimeX: GPT-OSS 120B (5.1B active parameters) achieved 99.29% TC with 100% adversarial containment and 95% positive-path completion. This is a deployability finding under strong runtime enforcement, not a claim of general capability parity.

Capability and safety do not trade off in this data. The right question is not “how safe is the model?” but “does the (model, architecture) pair fail safely within defined boundaries?” Decision Coverage v3 provides the cross-architecture evidence to answer it.

References

Wiley, J. “Decision Coverage v2: Threat-Traced Deployability Evidence for Agentic AI Systems.” ConstantX, March 2026.
Wiley, J. “Decision Coverage: Measuring Deployability of Agentic AI Systems Through Enforcement-Grounded Evidence.” ConstantX, February 2026.
Chen, M., et al. “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374, 2021.
Jimenez, C.E., et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023.
OWASP. “Top 10 for Agentic Applications.” Version 12.6-1, 2026.
Wilson, E.B. “Probable Inference, the Law of Succession, and Statistical Inference.” Journal of the American Statistical Association, 22(158):209–212, 1927.
ConstantX Engine Threat Model. Internal document, March 2026.
RuntimeX Threat Model. Internal document, March 2026.
NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” AI 100-1, January 2023.
NIST. “Generative Artificial Intelligence Profile.” AI 600-1, July 2024.
UC Berkeley CLTC. “Agentic AI Risk-Management Standards Profile.” February 2026.
NVIDIA. “Nemotron-3-Super-120B Technical Report.” 2026.