§ Bench / Public

The Deployment
Judgement Bench

Client education, FDE calibration, and evidence scaffolding for AI adoption teams. Built for teams scaling AI deployment into real organisations, where the hard problem is no longer whether the model can act, but whether humans can still intervene before action hardens.

AI adoption is no longer mainly a model capability problem.

It is a deployment judgement problem.

As AI systems move into live organisational workflows, the hard question is whether the organisation can still intervene, recover, refuse, redirect, and learn once the system is deployed.

§ 01 / Why

Why this exists

Most AI deployment failures will not look like sudden technical collapse. They will look like success.

Response times improve. Workflows accelerate. Escalations drop. Manual review shrinks. Dashboards remain calm. Under that apparent stability, recovery windows can narrow, human oversight can become performative, institutional memory can be bypassed, and local efficiency can harden into global rigidity.

The bench is built for the judgement required before that hardening occurs.

§ 02 / Primitives

The shared judgement grammar

Each surface of the bench uses the same primitives. The audience, automation ceiling, and output format change. The judgement discipline does not.

01Context anchoring: who is acting, who is affected, and what transition is being judged.
02Human-before-loop: where judgement can still intervene before execution hardens.
03Reversibility: what becomes harder to undo after the system acts.
04Recovery window: when response is still meaningful rather than symbolic.
05Intervention authority: who can pause, redirect, refuse, or escalate.
06Evidence and provenance: what can be reconstructed while it still matters.
07Institutional memory: what human knowledge may be compressed, bypassed, or lost.

§ 03 / Modes

Three surfaces of one bench

Mode 2

Public surface

Client Education

Deployment Judgement Snapshot

A browser-based, non-diagnostic demo for surfacing unresolved AI adoption questions.

Built for early client conversations and pre-diagnostic intake. The snapshot reflects participant-provided context and organises it into clear, partial, and not-yet-answered areas. It does not score, rate, diagnose, or approve a deployment.

Run the snapshot demo →

Mode 1

Unlisted preview

FDE Calibration

Judgement training simulator

A practitioner-facing surface for developing judgement under ambiguity.

Practitioners work through scenario evidence, commit a case note, and only then receive structured challenge prompts. The learner reasons first. The system challenges after commitment.

View the calibration surface →

Mode 3

Spec only

Evidence Scaffolding

Evidence & Responsibility Scaffold

A specification for regulated or high-consequence deployment review.

Designed to structure evidence around human oversight, logs, provenance, affected groups, escalation, recovery, and documentation gaps. Specification only. Not a legal opinion or automated compliance finding.

View the specification →

§ 04 / Runtime Bridge

The human-facing surface of runtime governance

The bench trains the judgement that runtime governance systems must eventually enforce. At the human layer, this appears as FDE judgement: can the organisation still intervene, recover, refuse, redirect, and learn?

At the machine layer, the same logic becomes pre-commitment admissibility: action is only allowed when the next transition preserves meaningful intervention capacity.

The gap between what an organisation says it can control and what the runtime system structurally preserves is a central research object for the bench.

§ 05 / Boundary

The distinction is the product.

Boundary

The bench does not score, diagnose, certify, rate, or approve deployments.

It structures judgement. That distinction is what separates an instrument that helps an organisation see its unresolved questions from a tool that pretends to see the organisation for them.

Full claim boundaries and validation posture →

Start with the snapshot. Then test the bench.

Run the snapshot demo Discuss a pilot