Docs/Quickstart: driving a task with an LLM raw .md

Quickstart: driving a task with an LLM

OpenErrand is "the pipe" — you bring the intelligence. The SDK's decide(ctx) callback turns each PageContext into the next Command; back it with whatever LLM you want. This guide shows it with Claude.

When you need this. A signed playbook with deterministic steps needs no LLM for the happy path — the steps drive it. Reach for an LLM decider for the cold-start / fallback path: a flow you haven't recorded yet, or a step that broke because the page changed.

#The shape

client.run({ url, userId, decide }) calls your decide(ctx) once per step. The extension sends up the page as a stripped element list (refs + labels + types — no values, no screenshot by default); your decider picks one action; the extension enforces it against the signed playbook fence and executes it. Loop until done.

#Install

npm install @anthropic-ai/sdk zod

#A Claude-backed decider

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";
import { zodOutputFormat } from "@anthropic-ai/sdk/helpers/zod";
import type { Command, PageContext } from "@obep/protocol";

const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY

// The closed OBEP action surface, as a schema the model MUST emit (structured output).
const CommandSchema = z.discriminatedUnion("action", [
  z.object({ action: z.literal("navigate"), url: z.string() }),
  z.object({ action: z.literal("click"), ref: z.string() }),
  z.object({ action: z.literal("fill"), ref: z.string(), value: z.string() }),
  z.object({ action: z.literal("fillSecret"), ref: z.string(), credentialKey: z.string() }),
  z.object({ action: z.literal("upload"), ref: z.string(), file: z.string() }),
  z.object({ action: z.literal("wait"), ref: z.string().optional(), timeoutMs: z.number().optional() }),
  z.object({ action: z.literal("extract"), ref: z.string(), as: z.string() }),
  z.object({ action: z.literal("done"), result: z.record(z.unknown()).optional() }),
]);

const SYSTEM = `You drive a web task in the user's own browser, one step at a time.
Each turn you receive the current page as a list of interactive elements (ref, type, label).
Choose exactly ONE next action from the allowed set to make progress toward the goal.

Rules:
- Address elements only by their "ref". Never invent a ref that isn't listed.
- To enter a saved login, use fillSecret with the credentialKey — never type a
  password as a value. You never see secret values.
- Call "done" when the goal is complete, with any extracted result.
- Prefer the smallest action that makes progress; do not guess at off-page navigation.`;

export function makeClaudeDecider(goal: string) {
  // Per-task memory: the goal, plus every page seen and action taken so far.
  const history: Anthropic.MessageParam[] = [{ role: "user", content: `Goal: ${goal}` }];

  return async function decide(ctx: PageContext): Promise<Command> {
    history.push({
      role: "user",
      content: `URL: ${ctx.url}\nElements:\n${ctx.interactiveElements
        .map((e) => `- ${e.ref} <${e.type}> ${e.label}`)
        .join("\n")}`,
    });

    const res = await anthropic.messages.parse({
      model: "claude-opus-4-8",
      max_tokens: 1024, // output is one small command
      system: [{ type: "text", text: SYSTEM, cache_control: { type: "ephemeral" } }],
      messages: history,
      output_config: { format: zodOutputFormat(CommandSchema) },
    });

    const command = res.parsed_output!; // validated against CommandSchema
    history.push({ role: "assistant", content: JSON.stringify(command) });
    return command as Command;
  };
}

Wire it into a task:

import { RelayClient } from "@obep/sdk";
import { WebSocket } from "ws";

const client = new RelayClient({ url: RELAY_WS, apiKey: API_KEY, WebSocketImpl: WebSocket });

const result = await client.run({
  url: "https://portal.example.com/login",
  userId: "dana",
  decide: makeClaudeDecider("Log in, upload the claim document, then read the confirmation number."),
});

#Why this is safe even though the model is "untrusted"

The enforcement layer treats the LLM as adversarial — which is exactly right:

  • The fence still wins. Whatever the model emits is re-checked on-device against the signed playbook's allowedDomains / allowedActions / allowedCredentialKeys before it runs. A hallucinated navigate to an off-fence domain is blocked, not executed.
  • The model never sees secrets. fillSecret carries only a credentialKey; the value is resolved from the on-device vault. And capture minimization means the model receives element labels, not field values — so a password on the page is never in the context you send to Claude.

So a misbehaving or prompt-injected model can, at worst, fail the task — it cannot exfiltrate a credential or escape the playbook's domains.

#Model choice for a per-step loop

run() calls your decider once per step, so latency compounds over a flow. The default here is claude-opus-4-8 (most capable). Because model choice is yours to make, if step latency matters more than per-step reasoning depth you can switch to a faster tier — claude-haiku-4-5 or claude-sonnet-4-6 — by changing the model string. For flows that need real reasoning (ambiguous pages, recovery), add adaptive thinking: thinking: { type: "adaptive" }.

#Keeping it cheap: caching across steps

Each step re-sends the whole conversation so far (the goal + every prior page + action), so the same prefix is processed on every call. Two caching levers:

  • The conversation prefix is the real win. Put a cache_control breakpoint on the last block of the most recent turn each step; the next step reads the cached prefix and only pays full price for the new page. This is the standard multi-turn pattern and it compounds as the flow grows.
  • The system prompt caches too — but only if it's large enough. The minimum cacheable prefix on Opus-tier models is 4096 tokens (2048 on Sonnet/Haiku); a short instruction block like the one above is below that, so its cache_control marker is a silent no-op. It pays off when your system prompt is large (detailed policy, few-shot examples). Verify with usage.cache_read_input_tokens — if it stays 0, nothing cached.

#Notes

  • messages.parse + output_config.format forces the model's output to match CommandSchema, so you get a validated Command with no brittle JSON parsing.
  • Keep decide deterministic-ish: address by ref, and let the extension's enforcement (not your prompt) be the security boundary.
  • This is "bring your own LLM" — the same decide(ctx) => Command contract works with any provider; only the call inside decide changes.