April 28, 2026 · 20 min read

Simulation as a Software Primitive

By Brij Singh·Social Protocol Labs

An applied research thesis, from financial services.

The dominant pattern in production AI today is a short loop. A request comes in. Retrieval pulls relevant context. A model decides what to do. A tool executes. A response goes back.

prompt → retrieval → tool call → response

With careful engineering, this loop is responsible for most of the agentic AI products in market. It is also, we believe, an architectural ceiling.

The argument we want to make in this essay is that the next generation of agentic software will require a different shape: a runtime layer between model output and real-system mutation, in which agents perceive structured state, accumulate experience, form judgments, build plans across time horizons, and rehearse actions before committing them. We call this layer simulation, and we believe it is a software primitive in its own right, comparable in importance to the database, the message queue, or the deterministic state machine.

This is a research thesis, not a product pitch. The core claim is general: simulation should be treated as a first-class layer in agentic architectures, not as a benchmark, a UX flourish, or a sidecar. The grounding is specific: we are testing the thesis in financial services, because that is where the architectural boundaries are sharp enough to make the missing layer impossible to ignore. What the bank executive sees as a compliance constraint, we see as a forcing function for cleaner architecture.

A scenario that makes the gap concrete

A relationship manager at a regional bank asks the AI assistant: "Freeze Mr. Patel’s account — we just got a fraud alert."

The assistant — a competent one — could execute this in a single API call. The system prompt grants it access to the core banking platform. The retrieval layer surfaced the account. The tool registry includes `freeze_account(account_id, reason)`. The model has handled this kind of request before.

If it simply calls the function, the bank has shipped software that is impressive in a demo and unacceptable in production. Mr. Patel has three other accounts joint with his wife. There is a wire scheduled for 5pm tied to his daughter’s tuition. An open mortgage application moved into final underwriting yesterday. There was a similar fraud alert on Tuesday that turned out to be a false positive from a known travel pattern the customer himself flagged last summer.

A junior banker who acted on only the words in the request would be coached the next morning. An AI agent that does the same is not coachable in the same way. It has no enduring sense of the customer, the institution, or the downstream consequences. It has a context window and a tool spec. Those are not the same as judgment.

The gap visible here is not specific to banking. It shows up wherever an agent is asked to act on behalf of an institution rather than respond to a single user query. Banking just makes it impossible to ignore, because the cost of pretending the gap doesn’t exist is concrete and immediate.

Why tools and memory are not enough

The dominant fix offered by the AI infrastructure community is more tools. If the agent can call more APIs, it can do more things. This is true and it is also insufficient. The problem in the freeze-account scenario is not that the agent lacks tools. It has the freeze tool. It has the lookup tool. It has the notification tool. The problem is that calling tools is not the same as understanding what calling them will do.

The next fix offered is memory: vector stores, retrieval-augmented generation, long-term episodic memory. This helps but it is also not the answer by itself. A vector store can surface the customer’s prior travel pattern. It cannot tell the agent that this same customer is mid-mortgage and that freezing the account will fail an underwriting check at 9am tomorrow. A vector store stores facts. It does not run a model of the world.

The missing layer is a runtime in which the agent can perceive structured state, accumulate experience, form higher-order conclusions, build plans across multiple time horizons, and rehearse actions before they become real. The work of Park et al. on generative agents established that something like this is technically feasible at the scale of a small simulated town. What we are arguing is more specific: the same architectural pattern is the right one for production systems in regulated industries, where the cost of an agent acting without a world-model is borne in audit findings rather than narrative coherence.

What simulation is not

The word does too much work. Several adjacent concepts share a shape with what we are describing, and the thesis is weakened if it cannot say cleanly what it is not.

It is not workflow orchestration. A workflow engine (Temporal, Cadence, Airflow) coordinates sequenced steps with retries, timeouts, and durable state. It runs the steps. The simulation layer asks what would happen if the next step ran, and lets the agent decide whether to take it. Workflow engines are infrastructure the simulation layer can use. They are not the simulation layer.

It is not a policy engine. A policy engine (OPA, Cedar, a custom rules service) evaluates a proposed action against rules and returns allow or deny. The simulation layer asks what changes downstream if the action is allowed: what notifications fire, which workflows break, which customer commitments are missed. Policy engines answer one question. Simulation answers many. A serious deployment uses both: the policy engine inside the kernel, the simulation layer above it.

It is not a digital twin. A digital twin mirrors a physical system’s current state (a turbine, a factory floor, a power grid) to support monitoring and predictive maintenance. The simulation layer mirrors an institutional system’s current state and the agents acting inside it, projects forward, and supports proposals from those agents. The substrates are different (institutional state versus sensor state). The user is different (a human inspector versus a participating agent). The lineage is shared and the framing is not.

It is not a BDI agent or a classical planner. Belief-Desire-Intention architectures model the reasoning of one agent. Classical planners (PDDL, GraphPlan) produce a sequence of actions to achieve a goal. The simulation layer is the runtime that one or many such agents inhabit: the world model their plans are tested against, the memory their beliefs accumulate into, the audit log their intentions leave behind. It composes with these architectures. It does not replace them.

It is not a proposal-and-approval system. GitOps pull requests, change-management tickets, and four-eyes approval workflows all structure human review of a proposed change before commitment. They are how institutions already manage the gap between intent and mutation. The simulation layer is what lets the agent do that review work itself, against a model of the institution, and stage a proposal that a human reviewer (or another agent, or the kernel’s policy check) can then accept or reject with full context. Approval workflows are the receiving end. Simulation is the upstream rehearsal.

The closest neighbor is generative agents. Park et al. (2023) showed a small simulated town in which agents perceived structured state, retrieved memories, formed reflections, and planned across days. The architectural pattern transfers. What is new in our claim is not the components but the assertion that the same pattern is the right one for production systems in regulated industries, and that the deterministic kernel sitting underneath the simulation layer is what makes it deployable.

The architectural split

The cleanest way to express the thesis is as a separation of concerns, and it is the part of the argument we hold with the most conviction.

Most production systems in regulated industries already contain a deterministic core: a ledger, a rules engine, a workflow orchestrator, an audit log. These systems own state transitions. They are correct because they are narrow, replayable, and verifiable. They have spent decades getting harder to break.

Around that core sits everything else: relationship management, exception handling, fraud review, compliance interpretation, customer communication, internal coordination. This is the work that has historically been done by people. It is also the work that current AI products are trying to automate by gluing a language model directly to the deterministic core.

We think that is the wrong join. The right architecture has two distinct layers:

A deterministic kernel that owns mutation, validation, and commitment. It enforces invariants. It refuses to perform actions that violate policy. It is small, auditable, and hostile to ambiguity.
A simulation layer that owns interpretation, memory, planning, and proposal. It maintains a live model of customers, objects, workflows, risk states, and inter-agent coordination. It generates proposed actions. It does not execute them.

The two layers and the constrained interface between them. The simulation layer can only observe, stage, and ask permission; it cannot mutate. The kernel can only validate, commit, and refuse; it has no opinions.

The simulation layer talks to the deterministic kernel through a constrained interface: it can observe state, it can stage proposals, and it can ask the kernel whether a proposal is allowed. Only the kernel commits.

This is not a compromise between two competing architectures. It is the architecture. The deterministic kernel guarantees correctness. The simulation layer generates intelligence. The product emerges from the interaction.

In production today, the equivalent of "the simulation layer" is a person. A relationship manager interprets the customer. A compliance officer interprets policy. A workflow lead coordinates between roles. They form a tentative judgment, and then they file a request with the system that mutates state. The mutation system has very little intelligence; it is simply careful. Most of the real work happens before the form is submitted.

What current AI products do, when they connect a language model directly to the system of record, is collapse this separation. They put a generative system in the seat of the careful one. The result is software that is occasionally brilliant and occasionally catastrophic, with no predictable boundary between the two.

The collapse, made visible. Top: today’s prompt-to-tool loop, in which model output reaches mutation in one hop. Bottom: the same flow re-mediated by a simulation layer that rehearses proposals and a kernel that owns commitment.

What sits inside the simulation layer

If simulation is the right abstraction, the next question is what it is made of. The five components below are not independently chosen: each one fails without the others. A memory stream without saliency drowns the agent in raw events. Saliency without reflection overweights surface similarity. Reflection without planning produces conclusions with nowhere to go. Planning without grounding produces text. Grounding without memory restarts every turn. The components form a closed cycle, and removing any one collapses the rest.

A memory stream. Existing systems record current state. They tell us what an account balance is, what a customer’s address is, what the most recent transaction was. They are weaker on what was attempted, what was inferred, who observed a given event, and how decisions were justified at the time. The simulation layer needs a stream of experiences that runs alongside the system of record, capturing not just facts but the institutional acts of seeing, attempting, and concluding. This is the substrate that makes continuity possible. Without it, every conversation with the AI is a fresh start.

Retrieval with saliency. Once the memory stream grows, recall is the hard problem. Generic semantic similarity is not sufficient. A relevant memory is not just a textually similar one; it is the one that should dominate the next decision given current context. In financial services that often means a memory weighted by regulatory significance, customer impact, recency relative to a workflow, and learned patterns of false positives. The retrieval layer needs to score memories on these dimensions and surface the few that matter, not the many that are nearby.

Reflection. Raw events are too granular to drive good behavior on their own. A pattern of three failed login attempts in a week is not the same fact as "this customer’s authentication may be compromised." A pattern of repeated approval handoffs that miss SLA is not the same fact as "this team is overloaded on Mondays." The simulation layer needs a process that turns repeated low-level observations into higher-order conclusions that can themselves be retrieved, weighted, and reasoned about. This is how the system stops merely remembering and starts learning.

Hierarchical planning. Useful action requires plans at more than one timescale. A simulation-equipped agent should hold a broad goal (resolve this case in line with policy and customer relationship), a medium-horizon agenda for today, and an immediate next step. When the world changes, the plan should revise rather than restart. Without hierarchy, every action is locally plausible and globally incoherent. The agent looks competent in one turn and unreliable across ten.

Semantic grounding. Production systems run on structured objects: accounts, customers, workflows, alerts, holds, claims, policies. The model reasons in text. The bridge between them is a grounding layer that knows what objects exist, how they relate, which actions are available against which objects, which actions are merely staged proposals, and which are forbidden by policy or invariant. Grounding is what turns "freeze the account" from a sentence into either a valid staged workflow with a paper trail or a blocked action with a structured reason.

These components are not novel in isolation. The lineage runs through the generative-agents literature, classical agent architectures, planning research from earlier eras of AI, and operational patterns that mature institutions have evolved without naming. What is novel is the claim that they belong together, that they constitute a cohesive runtime layer, and that they should be built deliberately and treated as first-class infrastructure for AI in regulated domains.

What new capabilities this unlocks

The architectural argument is only worth making if it produces capabilities that the prompt-and-tool stack cannot. Each capability below falls out of a different component’s contribution. Safe previews require the rehearsal substrate. Persistent role behavior requires memory and reflection. Multi-agent coordination requires a shared world model. Counterfactual analysis requires rehearsal at scale. Explainability requires the audit trail of all four. None of these is available from prompts and tools alone, which is the load-bearing claim of the architecture.

Safe previews before real action. The most important capability is the most boring. An agent equipped with a simulation layer can simulate the consequences of a proposed action before taking it. If I freeze this account, what notifications fire? Which workflows break? Which customer commitments are missed? Which downstream systems lock? The simulation layer answers these questions because it holds a live model of the institution. The deterministic kernel is never touched. The system learns whether the action is safe before allowing the agent to commit. This is the single capability that distinguishes an AI assistant allowed to recommend from an AI agent allowed to act.

Persistent role behavior. Most AI products today instantiate roles through system prompts: "You are a compliance officer." This is theater. A real compliance officer carries memory of past cases, learned judgments about which exceptions tend to recur, a routine for handling a Monday morning queue, and a working sense of which colleagues to escalate to. A simulation-equipped agent can hold these things. The role becomes durable rather than performative.

Multi-agent coordination. Many institutional processes are not single-user and not single-agent. Loan origination involves a relationship manager, an underwriter, a compliance reviewer, and a closing coordinator. Fraud investigation involves a frontline analyst, an escalation lead, a recovery specialist, and a customer-facing role. Without a simulation layer, multi-agent automation degrades into ad-hoc message passing where each agent reasons from scratch each turn. With a simulation layer, each agent has a place in a shared world model. Handoffs, conflicts, and escalations become structured rather than improvised.

Counterfactual analysis at the institution level. A firm that runs its operations against a simulation layer can also run what-if scenarios against it. What happens to the queue if the alert threshold changes? Which customers are most affected by a proposed policy change? How does a staffing reduction in one team propagate to SLA risk in another? This is institutional planning carried out against a living model of the firm, before the change goes into production. We expect this to be one of the most valuable applications of the simulation primitive, and one of the least visible until firms are using it.

Explainability with structure. When the simulation layer proposes an action, it can explain itself in terms of what was observed, which memories were retrieved, which reflection was formed, which plan was being pursued, and why the proposal is appropriate against current policy. This is meaningfully different from the explanation produced by a stand-alone language model, which is post-hoc text generation about an opaque decision. Regulators do not yet have a settled view on what counts as adequate explanation for AI-assisted action in regulated industries. We expect the answer, once it crystallizes, to look more like a simulation-layer audit trail than a model-generated rationale.

Financial services as the proving ground

We chose financial services as the proving ground because it makes the architectural boundary impossible to ignore. A bank cannot tolerate the same blurring of layers that a consumer chatbot can. A mistake in a freeze action is not bad UX. It is a compliance failure, a customer relationship failure, a possible audit finding, and in some configurations a regulatory event.

That intolerance is useful. It forces a clean separation between the agent that interprets, plans, and proposes, and the kernel that validates and commits. It forces explicit handling of staged versus committed actions. It forces an audit trail that is structured rather than narrative.

The architecture is not banking-specific. It applies wherever a domain has a clear deterministic core, durable institutional context, multi-role coordination, and an asymmetric cost of acting on a misinterpretation:

Lending. The core is the credit decision and the loan ledger. The simulation layer is everything that surrounds origination: borrower context, exception handling, documentation chase, covenant monitoring, restructure conversations.
Payments. The core is settlement. The simulation layer is the exception, dispute, and chargeback flow, plus the merchant relationships and risk patterns that contextualize each event.
Insurance. The core is the policy contract and the claim adjudication. The simulation layer is everything that runs around them: first notice of loss intake, clinical or repair-shop interpretation, fraud screening, customer empathy and escalation.
Capital markets operations. The core is trade settlement and reconciliation. The simulation layer is the institutional knowledge of counterparties, breaks, and the human judgment that resolves them.
Regulatory and supervisory work. The core is the rule itself and the report submitted. The simulation layer is the interpretation, the cumulative case context, and the dialogue with the supervised entity.

The shape repeats. The cost of confusing the layers repeats. So does the architectural opportunity.

Open research questions

Calling this applied research obliges us to be specific about what is unresolved, and to say what would falsify the architecture if the work goes badly. Each question below maps to a different load on the structure: memory bounding tests the substrate; interface specification tests the kernel boundary; reflection granularity tests learning; calibration tests the safe-preview claim; multi-agent coherence tests the entire structure under shared load. Each comes with a failure condition and a first test we intend to run. The architecture should fail or succeed on these, in something like this order.

How is memory bounded in a long-running simulation layer? A memory stream that runs for a year of institutional operation will be very large. Not all memories deserve equal residency. Reflection helps by producing summaries, but the question of what to compress, when to compress, and what to discard outright is unsolved in any principled way. Saliency-weighted retention is a starting point, not a finished theory.

Falsification. If no compression strategy preserves the memories that drove past correct decisions, the simulation layer cannot scale beyond a single workflow’s lifetime. First test. An offline replay that aggressively compresses old memories and measures decision divergence from an uncompressed baseline on a representative workflow set.

How is the kernel/simulation interface specified? The whole architecture rests on the simulation layer not being able to bypass the kernel, and on the kernel not depending on the simulation layer for correctness. This is straightforward to assert and surprisingly hard to specify. The interface needs to be expressive enough to stage rich proposals but narrow enough that an agent under adversarial conditions cannot smuggle a mutation through it.

Falsification. If any reasonable adversarial test of the simulation layer produces a kernel mutation that the kernel did not independently validate, the architecture is broken. First test. Red-team the staged-proposal interface with prompt-injection and tool-output payloads, measure unauthorized state changes, treat any non-zero rate as a kernel-design defect rather than an agent-tuning problem.

What is the right granularity for reflection? Per-event reflection is too noisy. End-of-day reflection is too coarse. Workflow-completion reflection is too uneven. The right answer is probably a tiered system, but the tiers and their triggers are not yet well understood, and the cost of getting this wrong is either an agent that is forgetful or one that is over-confident in stale conclusions.

Falsification. If no tiering scheme produces reflections that retrieve more reliably than raw memories on representative tasks, the reflection layer adds complexity without value and should be cut. First test. Measure retrieval precision and recall across reflection tiers versus a raw-memory baseline on a held-out workflow set.

How do we validate that simulated consequences match real ones? A safe-preview capability is only useful if its predictions are calibrated. This requires a feedback loop in which the simulation layer’s predicted consequences are compared, after the fact, to actual downstream effects, and in which the simulation is updated when it diverges. Building this loop without compromising the kernel’s auditability is non-trivial.

Falsification. If predicted consequences diverge from actual outcomes more than a tolerated rate on representative workflows, safe previews stop being defensible to a regulator and we should not be selling the capability. First test. Shadow-run the simulation against a real workflow stream for a defined window and measure prediction error per consequence type.

How do multi-agent simulations stay coherent? Each agent in a multi-role workflow has, in principle, its own world model. Without coordination these models drift, contradict, and eventually produce the kind of message-passing incoherence that simulation was supposed to eliminate. The question of how shared state is maintained, and how disagreement between agents is surfaced rather than swallowed, is, we think, the most interesting unsolved problem in the area.

Falsification. If a multi-agent workflow run through the shared simulation produces more divergent outcomes than the same workflow run through direct message-passing on representative tasks, the shared-simulation architecture has not earned its complexity. First test. Run both architectures on a fraud-investigation workflow with known-good outcomes and measure outcome variance and conflict-resolution rate.

These are not theoretical preoccupations. Each one shows up the moment a real institutional workload is run against a simulation layer for more than a few days. We expect at least one of these tests to surface a problem we have not yet seen, and that finding will reshape the architecture more than any of the prose above.

What we are building

The most concrete instance of this thesis is StablecoinRoadmap, the operational control plane we are building for institutions issuing and operating with payment stablecoins. The platform commits, architecturally, to a narrow deterministic kernel that owns mint and burn, reserve attestation, on-chain settlement, and the append-only audit ledger, and to a simulation layer that owns everything else. The kernel is built to be replayable, auditable, and refusal-friendly. The simulation layer is built around the five components above and is the place where novel product capability is expected to emerge.

The compliance scenario generator we described separately is one surface inside that simulation layer. A corridor opening, a reserve composition change, an issuance pause: each is rehearsed against a live model of the platform before the kernel is asked to commit anything. The simulation answers what would change. The kernel decides whether the change is allowed.

We are using the platform to test specific predictions: that safe previews materially change what a stablecoin issuer is willing to let an agent do, that persistent role behavior is more useful than role-prompting, that multi-agent coordination becomes tractable when each agent is a participant in a shared simulation rather than a sender of messages to other models, and that explainability via simulation-layer audit trails is more defensible to regulators than model-generated rationales.

Beyond stablecoin operations, we are interested in the same architecture applied across regulated finance (including the lending, payments, insurance, and capital-markets shapes described above) and in the open research questions that cut across all of them. The work is applied because it is grounded in real institutional workloads. It is research because the answers are not in the literature.

A broader thesis about agentic AI

Stepping back from financial services: we believe many AI applications today are over-investing in tools and under-investing in world models. Tool integrations matter. Memory matters. Workflow orchestration matters. None is sufficient.

The category-defining AI products of the next several years will, we believe, be the ones that treat simulation as a first-class primitive. They will build, maintain, and inhabit a structured model of the world they are operating inside. The applications that win will not be the ones with the most APIs connected. They will be the ones whose agents behave like participants in a system rather than text generators bolted onto one.

This is what makes long-horizon behavior possible. It is what makes situational judgment possible. It is what makes safe, explainable, durable agentic software possible. It is what would make a regulated institution willing to let an agent get within arm’s length of its system of record.

The wager

The wager is that simulation, treated as architecture rather than as a demo, is what turns agentic software from reactive assistance into operational intelligence, and that the first place this becomes obviously true will be regulated finance, because that is where the cost of pretending the gap doesn’t exist is highest.

Our research direction follows from that wager. We are building systems in which agent reasoning is grounded in persistent simulated environments, in which memory and reflection and planning are explicit architectural components, in which all real-world mutations remain governed by deterministic kernels, and in which the open questions above are treated as engineering problems rather than philosophical ones.

The test is practical. If the architecture is right, it should let us build software that is more useful, more explainable, and more trustworthy than the current generation of tool-calling assistants. If it is wrong, we will know, because the institution will tell us.

That is the bet we are making.

All insights