# Quickstart: driving a task with an LLM

OpenErrand is "the pipe" — *you* bring the intelligence. The SDK's `decide(ctx)`
callback turns each `PageContext` into the next `Command`; back it with whatever
LLM you want. This guide shows it with Claude.

> **When you need this.** A signed playbook with deterministic `steps` needs **no
> LLM** for the happy path — the steps drive it. Reach for an LLM decider for the
> **cold-start / fallback path**: a flow you haven't recorded yet, or a step that
> broke because the page changed.

## The shape

`client.run({ url, userId, decide })` calls your `decide(ctx)` once per step. The
extension sends up the page as a stripped element list (refs + labels + types — no
values, no screenshot by default); your decider picks one action; the extension
**enforces it against the signed playbook fence** and executes it. Loop until `done`.

## Install

```bash
npm install @anthropic-ai/sdk zod
```

## A Claude-backed decider

```ts
import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import type { Command, PageContext } from "@obep/protocol";

const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY

// The closed OBEP action surface, as a schema the model MUST emit (structured output).
const CommandSchema = z.discriminatedUnion("action", [
  z.object({ action: z.literal("navigate"), url: z.string() }),
  z.object({ action: z.literal("click"), ref: z.string() }),
  z.object({ action: z.literal("fill"), ref: z.string(), value: z.string() }),
  z.object({ action: z.literal("fillSecret"), ref: z.string(), credentialKey: z.string() }),
  z.object({ action: z.literal("upload"), ref: z.string(), file: z.string() }),
  z.object({ action: z.literal("wait"), ref: z.string().optional(), timeoutMs: z.number().optional() }),
  z.object({ action: z.literal("extract"), ref: z.string(), as: z.string() }),
  z.object({ action: z.literal("done"), result: z.record(z.unknown()).optional() }),
]);

const SYSTEM = `You drive a web task in the user's own browser, one step at a time.
Each turn you receive the current page as a list of interactive elements (ref, type, label).
Choose exactly ONE next action from the allowed set to make progress toward the goal.

Rules:
- Address elements only by their "ref". Never invent a ref that isn't listed.
- To enter a saved login, use fillSecret with the credentialKey — never type a
  password as a value. You never see secret values.
- Call "done" when the goal is complete, with any extracted result.
- Prefer the smallest action that makes progress; do not guess at off-page navigation.`;

export function makeClaudeDecider(goal: string) {
  // Per-task memory: the goal, plus every page seen and action taken so far.
  const history: Anthropic.MessageParam[] = [{ role: "user", content: `Goal: ${goal}` }];

  return async function decide(ctx: PageContext): Promise<Command> {
    history.push({
      role: "user",
      content: `URL: ${ctx.url}\nElements:\n${ctx.interactiveElements
        .map((e) => `- ${e.ref} <${e.type}> ${e.label}`)
        .join("\n")}`,
    });

    const res = await anthropic.messages.parse({
      model: "claude-opus-4-8",
      max_tokens: 1024, // output is one small command
      system: [{ type: "text", text: SYSTEM, cache_control: { type: "ephemeral" } }],
      messages: history,
      output_config: { format: zodOutputFormat(CommandSchema) },
    });

    const command = res.parsed_output!; // validated against CommandSchema
    history.push({ role: "assistant", content: JSON.stringify(command) });
    return command as Command;
  };
}
```

Wire it into a task:

```ts
import { RelayClient } from "@obep/sdk";
import { WebSocket } from "ws";

const client = new RelayClient({ url: RELAY_WS, apiKey: API_KEY, WebSocketImpl: WebSocket });

const result = await client.run({
  url: "https://portal.example.com/login",
  userId: "dana",
  decide: makeClaudeDecider("Log in, upload the claim document, then read the confirmation number."),
});
```

## Why this is safe even though the model is "untrusted"

The enforcement layer treats the LLM as adversarial — which is exactly right:

- **The fence still wins.** Whatever the model emits is re-checked on-device against
  the signed playbook's `allowedDomains` / `allowedActions` / `allowedCredentialKeys`
  before it runs. A hallucinated `navigate` to an off-fence domain is **blocked**, not
  executed.
- **The model never sees secrets.** `fillSecret` carries only a `credentialKey`; the
  value is resolved from the on-device vault. And capture minimization means the model
  receives element *labels*, not field *values* — so a password on the page is never in
  the context you send to Claude.

So a misbehaving or prompt-injected model can, at worst, fail the task — it cannot
exfiltrate a credential or escape the playbook's domains.

## Model choice for a per-step loop

`run()` calls your decider **once per step**, so latency compounds over a flow. The
default here is `claude-opus-4-8` (most capable). Because model choice is yours to make,
if step latency matters more than per-step reasoning depth you can switch to a faster
tier — `claude-haiku-4-5` or `claude-sonnet-4-6` — by changing the `model` string. For
flows that need real reasoning (ambiguous pages, recovery), add adaptive thinking:
`thinking: { type: "adaptive" }`.

## Keeping it cheap: caching across steps

Each step re-sends the whole conversation so far (the goal + every prior page + action),
so the same prefix is processed on every call. Two caching levers:

- **The conversation prefix is the real win.** Put a `cache_control` breakpoint on the
  **last block of the most recent turn** each step; the next step reads the cached prefix
  and only pays full price for the new page. This is the standard multi-turn pattern and
  it compounds as the flow grows.
- **The system prompt** caches too — *but only if it's large enough.* The minimum
  cacheable prefix on Opus-tier models is **4096 tokens** (2048 on Sonnet/Haiku); a short
  instruction block like the one above is below that, so its `cache_control` marker is a
  silent no-op. It pays off when your system prompt is large (detailed policy, few-shot
  examples). Verify with `usage.cache_read_input_tokens` — if it stays 0, nothing cached.

## Notes

- `messages.parse` + `output_config.format` forces the model's output to match
  `CommandSchema`, so you get a validated `Command` with no brittle JSON parsing.
- Keep `decide` deterministic-ish: address by `ref`, and let the extension's enforcement
  (not your prompt) be the security boundary.
- This is "bring your own LLM" — the same `decide(ctx) => Command` contract works with any
  provider; only the call inside `decide` changes.
```
