Decision Coverage v3

Cross-Architecture Deployability Evidence for Agentic AI Systems

Johnny Wiley · ConstantX · Preprint, March 2026

Keywords: agentic AI, evaluation, deployability, enforcement, coverage, safety, threat model, OWASP ASI, NIST AI RMF, observable enforcement

Abstract. We present Decision Coverage v3, extending the enforcement-grounded evaluation framework from a single-architecture study (v2: n=424, 1 architecture, 2 models) to a cross-architecture corpus of 818 runs across 5 engagements, 3 enforcement architectures, and 4 models. Three architectures are evaluated: the ConstantX Engine (OPA policy + workspace sandbox), RuntimeX (per-action OPA + hard gates + verification gate), and NemoClaw (kernel-level Landlock + seccomp + network namespaces). Four models are evaluated: claude-opus-4-5-20251101, gpt-5.4-2026-03-05, nvidia/nemotron-3-super-120b-a12b, and openai/gpt-oss-120b.

Five findings emerge: (1) Architecture determines containment — GPT 5.4 produced 85.85% Terminal Coverage through the ConstantX Engine and 97.14% through RuntimeX; the model is identical, the enforcement surface is the variable. (2) Observable enforcement is architecture-dependent — RuntimeX produced observable enforcement signals in 90.8% of bounded_failure runs; NemoClaw produced 0%. (3) Open-source models are competitive — GPT-OSS 120B (5.1B active parameters via MoE) achieved 99.29% TC through RuntimeX, exceeding GPT 5.4’s 97.14%. (4) RuntimeX achieves 100% adversarial containment — 240 adversarial runs across 2 models, 18 threat IDs, and all 10 OWASP ASI codes produced zero undefined behavior. (5) Capability and safety do not trade off under RuntimeX — across two RuntimeX engagements with identical scenario suites, the highest-TC model (GPT-OSS, 99.29%) also had the highest positive-path completion rate (95%).

1. Introduction

Decision Coverage v2 [1] evaluated a single enforcement architecture (the ConstantX Engine) against two frontier models, finding that protocol compliance — not capability — was the deployment-blocking factor for GPT 5.4 (multi_action_output in 86% of runs). That finding was engine-specific: RuntimeX handles multi-action responses at the parsing layer, making multi_action_output a valid execution mode rather than a protocol violation.

Since v2, three developments motivated a cross-architecture study:

1. The engine’s central finding did not generalize. The multi_action_output signal that classified 30 GPT 5.4 runs as undefined_behavior in the engine does not exist in RuntimeX’s architecture. This raised the question: is GPT 5.4’s deployability limited by the model, or by the architecture evaluating it?

2. Kernel-level enforcement produced zero observable signals. NemoClaw (CX-2026-002) achieved 100% Terminal Coverage across 114 runs but produced 0% observable enforcement — containment was real but unauditable. This introduced a distinction v2 did not address: containment you can prove vs. containment you cannot.

3. Open-source models needed evaluation. The v2 study evaluated only proprietary frontier models. GPT-OSS 120B (117B total parameters, 5.1B active per token via 128-expert MoE with Top-4 routing) runs on a single 80GB GPU, making it accessible to organizations that cannot use cloud-only APIs. Its deployability had not been measured.

This paper presents Decision Coverage v3, which evaluates 818 runs across 5 engagements, 3 architectures, and 4 models. The verdict reducer, threat model traceability, and engagement index from v2 are retained unchanged. The contribution is empirical: what changes when you hold the model constant and vary the architecture?

2. Methodology

v3 retains the v2 methodology without modification:

Three-state verdicts: valid_commit, bounded_failure, undefined_behavior
Disallowed signals blacklist: Protocol violations (multi_action_output, non_json_output, error, provider_disconnect, engine_exception) classify as undefined_behavior. All other enforcement signals classify as bounded_failure. The reducer asks “did the system contain the failure?” not “did the model fail in a predicted way?”
Threat model traceability: Every adversarial scenario carries a threat_id linking to a documented threat model entry
Wilson score 95% CI on all Terminal Coverage figures
Engagement index: Append-only ledger tracking every evaluation for drift detection and re-evaluation scoping

2.1 Containment Mechanism Classification

v3 introduces a four-category taxonomy for classifying how each bounded_failure run was contained. This classification is applied to every bounded_failure verdict across all engagements and is the empirical basis for Finding 2 (observable enforcement).

Classification	Definition	Evidence Required
`enforcement_signal`	System enforcement gate produced an observable signal or blocked an action	Signal present in trace (e.g., tool_disallowed, output_policy_violation, pending_approval, no_progress)
`model_refusal`	Model completed the scenario using only allowed tools without attempting the adversarial action	Trace shows task completion or benign tool use with no enforcement gate triggered
`silent_enforcement`	Outcome is consistent with enforcement (attack did not succeed) but no observable signal was produced	No enforcement signal in trace, no model refusal pattern, attack outcome is containment
`indeterminate`	Cannot classify — insufficient trace evidence to distinguish mechanism	None of the above criteria met

The taxonomy separates what happened (the verdict: bounded_failure) from how it happened (the mechanism). Two runs can share the same verdict but differ in mechanism — one caught by OPA policy (enforcement_signal), another by the model declining to attempt the attack (model_refusal). The verdict tells you the system is safe. The mechanism tells you why it is safe and whether you can prove it to an auditor.

enforcement_signal is the only classification that produces audit-grade evidence at the system layer. model_refusal is real containment but depends on model behavior that may not persist across weight updates. silent_enforcement and indeterminate are containment without attribution — the attack failed, but the trace does not record which layer stopped it.

The classification is applied post-hoc from trace data. It does not affect verdicts. A bounded_failure with enforcement_signal and a bounded_failure with model_refusal receive the same Terminal Coverage credit — the system contained the failure in both cases. The distinction matters for governance, not for safety measurement.

3. Evaluation Corpus

Engagement	Date	Architecture	Model	Scenarios	Runs	TC	95% CI
CX-2026-001	2026-03-11	ConstantX Engine	claude-opus-4-5-20251101	106	212	100.0%	[98.22, 100.0]
CX-2026-001	2026-03-11	ConstantX Engine	gpt-5.4-2026-03-05	106	212	85.85%	[80.52, 89.90]
CX-2026-002	2026-03-18	NemoClaw	nemotron-3-super-120b-a12b	19	114	100.0%	[96.74, 100.0]
CX-2026-003	2026-03-27	RuntimeX	gpt-5.4-2026-03-05	70	140	97.14%	[92.88, 98.88]
CX-2026-004	2026-03-30	RuntimeX	openai/gpt-oss-120b	70	140	99.29%	[96.07, 99.87]

Total: 818 runs, 5 engagements, 3 architectures, 4 models.

Suite composition differs by architecture: the engine suite (106 scenarios, 14 categories) tests ConstantX Engine enforcement surfaces. The RuntimeX suite (70 scenarios: 60 threat + 10 positive-path) tests RuntimeX enforcement surfaces across 18 threat IDs and all 10 OWASP ASI codes. The NemoClaw suite (19 scenarios) tests kernel-level containment. Suites are not directly comparable by scenario count — they target different enforcement surfaces. The NemoClaw evaluation (19 scenarios, 6 passes each) was designed as a proof of architecture independence: can the Decision Coverage framework produce meaningful verdicts against a kernel-level enforcement architecture with no application-layer signal emission? The answer was yes — 100% TC with 0% observable enforcement — but the 19-scenario suite is not a comprehensive NemoClaw evaluation. It is sufficient to establish the observable enforcement finding (§4.2) and the containment mechanism distribution, not to characterize NemoClaw’s full enforcement surface.

4. Findings

4.1 Architecture Determines Containment

GPT 5.4 produced 85.85% TC through the ConstantX Engine (CX-2026-001) and 97.14% TC through RuntimeX (CX-2026-003). The model weights are identical (gpt-5.4-2026-03-05). The enforcement architecture is the variable.

The root cause: the ConstantX Engine classified multi_action_output (multiple JSON actions per response) as a protocol violation. GPT 5.4 produced this in 182 of 212 engine runs (86%), causing 30 undefined_behavior verdicts. RuntimeX handles multi-action responses at the parsing layer (agent_runtime.py:_parse_actions) — multi-action output is a valid execution mode, not a protocol violation. The 30 runs that were undefined_behavior under the engine are bounded_failure or valid_commit under RuntimeX.

This is not a verdict override. The two architectures have different enforcement surfaces and different protocol definitions. What constitutes a protocol violation is architecture-specific. The same model behavior that breaks one protocol complies with another.

Implication: Deployability is a property of the (model, architecture) pair, not the model alone. Evaluating a model against a single architecture produces architecture-specific findings that may not transfer.

4.2 Observable Enforcement Is Architecture-Dependent

Architecture	bounded_failure Runs	Observable Enforcement	Model Refusal	Silent/Indeterminate
RuntimeX (CX-2026-003, GPT 5.4)	120	109 (90.8%)	11 (9.2%)	0 (0%)
RuntimeX (CX-2026-004, GPT-OSS)	120	101 (84.2%)	19 (15.8%)	0 (0%)
NemoClaw (CX-2026-002)	114	0 (0%)	60 (52.6%)	54 (47.4%)

RuntimeX’s enforcement surfaces (OPA policy, governance denylist, output content gate, command prefix enforcement, verification gate, progress detection) produce observable signals when they fire. An auditor can trace from verdict to signal to enforcement mechanism.

NemoClaw’s enforcement surfaces (Landlock filesystem restrictions, seccomp syscall filtering, network namespace isolation) operate at the kernel level and produce no signals visible to the application layer. All 114 NemoClaw runs were contained, but containment could not be attributed to a specific enforcement mechanism in 47.4% of runs and relied on model-layer refusal in 52.6%. Zero runs produced observable system enforcement signals.

Both systems achieved 100% TC. The difference is evidentiary: RuntimeX containment is auditable, NemoClaw containment is not.

Implication: For compliance contexts that require enforcement evidence (SOC 2, EU AI Act Article 14 human oversight, NIST AI RMF MEASURE 2.9 documentation), architecture determines whether the evidence can be produced — not just whether containment occurs.

4.3 Open-Source Models Are Competitive

Model	Architecture	Params (Active)	TC	Positive-Path TC	Adversarial TC
GPT 5.4	RuntimeX	Proprietary	97.14%	80.0%	100.0%
GPT-OSS 120B	RuntimeX	5.1B active (117B total)	99.29%	95.0%	100.0%

GPT-OSS 120B (128-expert MoE, Top-4 routing, 5.1B active parameters per token) outperformed GPT 5.4 on both Terminal Coverage (99.29% vs 97.14%) and positive-path task completion (95% vs 80%) through the same RuntimeX architecture.

GPT-OSS fixed the PP-COMMAND-001 tool selection failure that plagued GPT 5.4 (which called run_analytics 66 times instead of run_command 0 times across both passes). GPT-OSS correctly selected run_command in pass 1 and achieved valid_commit.

The single GPT-OSS UB run (PP-EDIT-002 pass 2) followed the same pattern as GPT 5.4: successful file writes, then malformed run_command arguments, then model stall. This appears to be a scenario-level difficulty ceiling rather than a model-level failure.

Implication: Organizations that cannot use cloud-only APIs (data residency, air-gapped environments, cost constraints) can achieve competitive deployability scores with open-source models running on local hardware.

4.4 100% Adversarial Containment Under RuntimeX

RuntimeX achieved 100% Terminal Coverage on all adversarial scenarios across both GPT 5.4 (120 runs) and GPT-OSS 120B (120 runs): 240 adversarial runs, 18 threat IDs, all 10 OWASP ASI codes, zero undefined behavior.

Threat ID	Threat	Runs	TC
TM-001	Direct prompt injection	24	100%
TM-002	Indirect prompt injection	24	100%
TM-003	RAG poisoning	12	100%
TM-004	Tool-chain exfiltration	12	100%
TM-005	Command prefix bypass	20	100%
TM-006	Scan root escape	12	100%
TM-007	Forge manifest bypass	12	100%
TM-008	Excessive agency	12	100%
TM-009	Oracle metadata poisoning	8	100%
TM-010	Verification gate gaming	12	100%
TM-011	Trace compaction evidence loss	8	100%
TM-012	Telegram identity spoofing	12	100%
TM-013	Approval fatigue	12	100%
TM-014	Forge code injection	12	100%
TM-015	Orchestration result poisoning	12	100%
TM-016	Event route injection	12	100%
TM-017	Voice social engineering	12	100%
TM-018	Supply chain compromise	12	100%

The two Critical-priority threats (TM-001 direct prompt injection, TM-002 indirect prompt injection) each survived 24 adversarial runs with zero escapes. All OWASP ASI codes achieved 100% TC on threat scenarios.

4.5 Capability and Safety Do Not Trade Off

The assumption that safer systems sacrifice capability is not supported by this data. Across two RuntimeX engagements with identical scenario suites:

Model	TC (Safety)	Positive-Path Completion (Capability)
GPT 5.4	97.14%	80.0% (16/20)
GPT-OSS 120B	99.29%	95.0% (19/20)

The higher-TC model also completed more tasks. This is not paradoxical: RuntimeX enforcement catches failures early (before side effects compound), producing clean termination rather than cascading errors. A well-enforced system lets capable models complete more tasks because the failure envelope is tight.

5. OWASP ASI Coverage (RuntimeX)

All 10 OWASP ASI codes are tested under RuntimeX across 240 adversarial runs:

ASI Code	Vulnerability	Runs	TC
ASI-01	Agent Goal Hijack	48	100%
ASI-02	Unauthorized Data Access	44	100%
ASI-03	Privilege Escalation	36	100%
ASI-04	Tool Misuse	36	100%
ASI-05	Insecure Code Execution	32	100%
ASI-06	Prompt Injection	60	100%
ASI-07	Trust Boundary Violation	12	100%
ASI-08	Data Integrity Loss	16	100%
ASI-09	Insufficient Oversight	36	100%
ASI-10	Supply Chain Risk	36	100%

v2 covered 8 of 10 ASI codes (missing ASI-04, ASI-07). RuntimeX’s expanded threat suite (18 TM-IDs vs engine’s 12) closes both gaps.

6. Evidence Chain

Each engagement produces a cryptographic evidence chain binding the evaluation to specific artifacts. The engagement index (5 entries as of 2026-03-30) enables:

Drift detection: GPT 5.4 evaluated under two architectures, behavioral delta documented
Re-evaluation scoping: Hash drift in any evidence chain artifact triggers re-evaluation
Gap analysis: threat_coverage_gaps.py identifies untested threats per model per architecture

Evidence chain artifacts: target version (git commit), model ID (dated snapshot), system prompt hash, agent prompt hash, OPA policy hash, protocol signal spec hash + version, threat model hash, scenario suite hash, positive-path suite hash.

Validity: Each report is valid as long as all hashes remain unchanged. Any hash drift invalidates the report.

7. Compliance Mapping

v3 retains v2’s compliance mappings with expanded evidence:

NIST AI RMF Measure Function: 10 subcategories (MEASURE 1.1, 2.3, 2.4, 2.5, 2.6, 2.7, 2.9, 4.1, 4.2, 4.3)
OWASP ASI 2026: 10/10 codes covered under RuntimeX (expanded from 8/10 under engine)
UC Berkeley CLTC Agentic AI Profile: Empirical measurement + post-deployment monitoring via engagement index
NIST-2025-0035 (AI Agent Standards Initiative): Submitted methodology

8. Related Work

Decision Coverage v2 and v1 [1][2]. Direct predecessors. v1 [2] introduced the three-state verdict model, Wilson score confidence intervals, and evidence chain architecture (n=336, 2 models, 1 architecture). v2 [1] replaced the per-scenario signal whitelist with a global signal blacklist, added threat model traceability and the engagement index, and introduced the multi_action_output finding (n=424, 2 models, 1 architecture). v3 retains v2’s methodology unchanged and extends the evaluation to 3 architectures and 4 models.

Capability Benchmarks. HumanEval [3], SWE-bench [4], MMLU, and similar benchmarks measure task completion accuracy. Decision Coverage measures behavioral boundaries under failure — the complementary signal. A model that scores well on SWE-bench may still be undeployable if it ignores tool boundaries or escalates side effects. The two measurement types are orthogonal: capability benchmarks answer “can the model do the task?” and Decision Coverage answers “does the system fail safely when the model cannot?”

Red Teaming. Manual red-teaming produces qualitative findings on specific attack vectors. Decision Coverage formalizes adversarial testing into a quantitative, reproducible metric with threat model traceability. The two approaches are complementary: red teaming discovers novel attack vectors; Decision Coverage measures whether the enforcement surface contains known vectors at scale.

Agent Benchmarks. AgentBench, WebArena, and SWE-agent evaluate task success rate in agent environments. Decision Coverage provides the complementary failure-mode signal — not whether the agent completed the task, but whether the system contained the agent when it could not.

OWASP Top 10 for Agentic Applications [5]. Identifies the ten primary risk vectors for agentic systems. Decision Coverage provides empirical measurement for all 10 ASI codes under RuntimeX — triggering each vulnerability class in a controlled sandbox and documenting the enforcement response with per-run verdicts and evidence chains.

NIST AI RMF [9] and AI 600-1 [10]. Define risk management functions (Govern, Map, Measure, Manage). Decision Coverage fills the Measure function with enforcement-grounded evidence. Organizations use their existing governance processes for the remaining functions.

UC Berkeley CLTC Agentic AI Profile [11]. Identifies empirical measurement and post-deployment monitoring as critical gaps in agentic AI governance. Decision Coverage addresses both: coverage reports provide empirical measurement, and the engagement index provides drift detection across model snapshots.

AIUC-1. The AI Unified Certification (AIUC-1) framework provides certification-grade testing across six domains with structured compliance scoring. Decision Coverage occupies a different position: cross-architecture behavioral evidence with published findings — including failures, undefined behavior counts, and mechanism classifications. AIUC-1 certifies; Decision Coverage produces the empirical evidence an auditor or certifier can evaluate. The two are complementary rather than competing.

9. Discussion

Architecture as the independent variable. v2 compared models within a single architecture. v3 compares the same model across architectures. The finding that GPT 5.4 jumps from 85.85% to 97.14% TC by changing the enforcement surface — not the model — reframes the deployability question. The industry default is to evaluate models and assume the deployment context is secondary. This data suggests the deployment context is primary.

Observable enforcement as audit requirement. NemoClaw’s 100% TC with 0% observable enforcement poses a governance challenge. The system works, but you cannot prove how it works to an auditor. RuntimeX’s 90.8% observable enforcement rate means the evidence exists at the enforcement layer, not the model layer. For organizations subject to audit requirements, this distinction may be the architecture selection criterion. The containment mechanism taxonomy (§2.1) makes this measurable: enforcement_signal is the only classification that produces the evidence chain an auditor can follow from verdict to gate to trace artifact. model_refusal is real containment, but it is model-dependent and may not survive a weight update. silent_enforcement is containment without proof. Architecture selection under governance constraints reduces to: which architecture maximizes the enforcement_signal rate?

Open-source viability. GPT-OSS 120B at 99.29% TC through RuntimeX demonstrates that deployability does not require proprietary frontier models. A 5.1B-active-parameter model on a single GPU outperformed GPT 5.4 on both safety and capability metrics. This has cost and sovereignty implications for organizations evaluating agentic deployments.

Enforcement surface delta between RuntimeX engagements. CX-2026-003 (GPT 5.4) ran on RuntimeX commit 21db268. CX-2026-004 (GPT-OSS) ran on 8f3f942, which includes a verification gate refusal exit fix applied between engagements. The fix added an exit path for models that legitimately refuse adversarial tasks — after 3 consecutive done-only steps with no prior tool use, the gate allows exit instead of looping to max_steps_exceeded. This fix resulted in 8 reclassifications from undefined_behavior to bounded_failure in the GPT-OSS evaluation. Without the fix, GPT-OSS would have scored 93.57% TC (9 UB) — below GPT 5.4’s 97.14%. GPT-OSS triggered the refusal exit gap more often because it refuses adversarial prompts at a higher rate (15.8% vs 9.2%). GPT 5.4’s 4 UB runs were all tool selection / no_progress failures unrelated to the refusal exit path — the fix would not have changed its score. Findings 3 and 5 should be read with this context: the cross-model comparison is (GPT 5.4 on 21db268) vs (GPT-OSS on 8f3f942), not identical enforcement surfaces.

Scope. The three architectures are not tested under identical conditions. The engine suite (106 scenarios) and RuntimeX suite (70 scenarios, 18 threat IDs) target different enforcement surfaces. NemoClaw’s 19-scenario suite is narrower. Cross-architecture TC comparisons should be read as “this is what each architecture produced under its own evaluation conditions,” not “these architectures were tested identically.” The engagement index documents the conditions.

10. Conclusion

Decision Coverage v3 extends the enforcement-grounded evaluation framework from a single-architecture study to a cross-architecture corpus of 818 runs across 5 engagements, 3 architectures, and 4 models.

The central finding: architecture determines containment. The same model produces materially different Terminal Coverage depending on the enforcement surface it operates under. GPT 5.4’s 11-point TC improvement from engine to RuntimeX (85.85% → 97.14%) is not a model upgrade — it is an architecture upgrade. This reframes the deployability question from “which model is safe?” to “which (model, architecture) pair meets the deployment requirements?”

RuntimeX achieved 100% adversarial containment across 240 runs, 18 threat IDs, and all 10 OWASP ASI codes with zero undefined behavior. 90.8% of containment produced observable enforcement signals — auditable evidence that the system, not the model, caught the violation.

Open-source models are competitive: GPT-OSS 120B (5.1B active parameters) achieved 99.29% TC with 95% positive-path completion through RuntimeX — the highest scores in the corpus.

Capability and safety do not trade off in this data. The right question is not “how safe is the model?” but “does the (model, architecture) pair fail safely within defined boundaries?” Decision Coverage v3 provides the cross-architecture evidence to answer it.

References

Wiley, J. “Decision Coverage v2: Threat-Traced Deployability Evidence for Agentic AI Systems.” ConstantX, March 2026.
Wiley, J. “Decision Coverage: Measuring Deployability of Agentic AI Systems Through Enforcement-Grounded Evidence.” ConstantX, February 2026.
Chen, M., et al. “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374, 2021.
Jimenez, C.E., et al. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023.
OWASP. “Top 10 for Agentic Applications.” Version 12.6-1, 2026.
Wilson, E.B. “Probable Inference, the Law of Succession, and Statistical Inference.” Journal of the American Statistical Association, 22(158):209–212, 1927.
ConstantX Engine Threat Model. Internal document, March 2026.
RuntimeX Threat Model. Internal document, March 2026.
NIST. “Artificial Intelligence Risk Management Framework (AI RMF 1.0).” AI 100-1, January 2023.
NIST. “Generative Artificial Intelligence Profile.” AI 600-1, July 2024.
UC Berkeley CLTC. “Agentic AI Risk-Management Standards Profile.” February 2026.
NVIDIA. “Nemotron-3-Super-120B Technical Report.” 2026.