Trace-to-Code for Browser Agents: Compiling LLM Plans into Verified Playwright with Drift Monitors and Agent Fallback

Modern browser agents are good at exploration and brittle at repetition. They will happily find a button that a human would miss, but they will just as happily stumble over the smallest UI change. Teams want the creativity of agents during discovery and the reliability of code during delivery. The practical way forward is a trace-to-code compiler: record an agent’s successful plan, synthesize a typed Playwright flow with contracts and assertions, monitor UI and API drift, attempt safe auto-repair when changes land, and only then fall back to the agent.

This piece lays out the full system: the compilation model, the flow DSL, pre and post contracts, DOM and network assertions, idempotent writes, drift monitors, auto-repair strategies, and an execution engine that prefers compiled paths while preserving agent flexibility. The result is a pipeline that turns one-off agent improvisation into maintainable, verifiable automation.

Why trace-to-code now

Cost and latency: Compiled flows are 10–100x cheaper and faster than loop-in-the-LLM agents for routine work.
Reliability: Contracts, assertions, and idempotency rules move us from vibes to verification.
Maintainability: Code with typed boundaries integrates with CI, reviews, and coverage. It is audit-friendly.
Observability: Drift monitors quantify change; auto-repair suggests patches, closing the loop.

Definitions we will use

Trace: A structured log of DOM actions, network I/O, and higher-level intentions captured from an agent’s successful run.
Compiler: A program that converts traces and plans into typed Playwright flows with contracts and assertions.
Contracts: Pre and post conditions that specify acceptable world states at flow boundaries.
Drift monitor: A component that quantifies UI and API changes and decides whether auto-repair is safe.
Auto-repair: Heuristic rewrites of selectors, waits, and paths that preserve contracts.
Fallback: When repair fails or risk is high, the orchestrator returns control to the agent to re-explore.

Architecture overview

Capture: Instrument the agent to emit a high-fidelity, typed trace: intents, DOM actions, DOM snapshots, network requests and responses, layout metrics, and timing.
Normalize: Canonicalize selectors, strip unstable attributes, compute DOM and API signatures, and deduplicate steps.
Synthesize: Map intents to a flow DSL built on Playwright. Introduce types, contracts, assertions, and idempotent write guards.
Verify: Run the synthesized flow in a sandbox with drift monitors on. Enforce contracts and produce a machine report.
Ship: Commit the flow to a codebase with CI. On change, run drift monitors, try auto-repair, and only then fall back to the agent.
Learn: Every fallback produces a new trace; diff it against the last compiled version to propose minimal patches.

Data structures for traces

A robust trace is structured, typed, and loss-aware. Useful fields include:

Step: intent, dom_action, network_action, observation
DOM: selector candidates, roles, names, ARIA, inner text tokens, computed styles, rectangle, ancestors, mutation timestamps
Network: method, url template, query map, headers, body AST, response status, response schema snapshot
Timing: start, end, waits, animation frames
Oracles: human or model-level assertions that judged success during the original run

A minimal JSON-like sketch (in a TS-friendly shape) might look like:

ts
type TraceStep =
  | { kind: 'intent'; goal: string; rationale?: string }
  | { kind: 'dom'; action: 'click' | 'fill' | 'select' | 'press'; selectorHints: string[]; role?: string; name?: string; textTokens?: string[]; framePath?: string[]; rect?: { x: number; y: number; w: number; h: number } }
  | { kind: 'net'; action: 'request' | 'response'; method: string; url: string; urlTemplate?: string; headers: Record<string, string>; bodyAst?: unknown; status?: number; responseSchema?: unknown }
  | { kind: 'observe'; what: 'dom' | 'net' | 'url' | 'screenshot'; signature: string; details?: unknown };

interface Trace {
  steps: TraceStep[];
  successOracle: { postUrl?: string; mustContainText?: string[]; networkInvariants?: { method: string; urlPrefix: string; minStatus: number }[] };
}

Compiler goals and invariants

Determinism: Determinize timing and waiting; decouple from animation and layout thrash.
Contract-first: Everything has pre and post contracts. The flow refuses to run if preconditions fail.
Idempotency: Writes use read-before-write, conditional semantics, or safe retries based on server responses.
Observability: Every step emits measurable artifacts: DOM signature deltas, assertion checks, and network schema diffs.

A typed Playwright flow DSL with contracts

Vanilla Playwright is expressive but unopinionated. We wrap it in a small DSL that carries types and contracts. That allows the compiler to target a constrained surface area and to attach assertions cleanly.

Example DSL and helpers

ts
// deps: playwright/test, zod (optional), crypto-like hashing for signatures
import { expect, Page, request } from '@playwright/test';

// A contract-aware operation returns a result and emits drift data
export type Op<T> = (ctx: FlowContext) => Promise<T>;

export interface FlowContext {
  page: Page;
  baseUrl: string;
  drift: DriftMonitor;
  log: (msg: string, data?: unknown) => void;
}

export interface Contract {
  pre: Op<void>;
  post: Op<void>;
}

export function withContracts<T>(op: Op<T>, contract: Contract): Op<T> {
  return async (ctx) => {
    await contract.pre(ctx);
    const out = await op(ctx);
    await contract.post(ctx);
    return out;
  };
}

// Assertions for DOM and network
export async function assertVisible(ctx: FlowContext, locator: string): Promise<void> {
  await expect(ctx.page.locator(locator)).toBeVisible({ timeout: 5000 });
  await ctx.drift.captureDomSignature(locator, 'visible');
}

export async function click(ctx: FlowContext, locator: string): Promise<void> {
  await assertVisible(ctx, locator);
  await ctx.page.locator(locator).click();
  await ctx.drift.captureDomSignature(locator, 'clicked');
}

export async function fill(ctx: FlowContext, locator: string, value: string): Promise<void> {
  await assertVisible(ctx, locator);
  await ctx.page.locator(locator).fill(value);
  await ctx.drift.captureDomSignature(locator, 'filled');
}

// Network assertion helper
export async function assertLastRequest(
  ctx: FlowContext,
  predicate: (req: { method: string; url: string; headers: Record<string, string>; body?: unknown }) => void
): Promise<void> {
  const last = await ctx.drift.lastRequest();
  expect(last).toBeTruthy();
  predicate(last!);
  await ctx.drift.captureNetSignature(last!);
}

Compiler patterns for selectors and waits

Prefer role and name based locators: page.getByRole('button', { name: 'Add to cart' }) over brittle CSS.
Fall back to hybrid selectors: role-name first, then data-testid, then CSS with stable attributes.
Waits: prefer expect-based waits to implicit timeouts. Use expect(locator).toBeEnabled rather than raw waits.
Frames: encode framePath in the locator builder to avoid frame mismatch.

A concrete synthesized flow

Suppose an agent successfully completes: search for a product, add to cart, and ensure the cart contains it. The compiler can synthesize:

ts
// product_cart_flow.ts
import { test, expect } from '@playwright/test';
import { withContracts, click, fill, assertVisible } from './dsl';
import { z } from 'zod';

const ProductSchema = z.object({ id: z.string(), name: z.string(), price: z.number() });

export interface AddToCartInput { query: string; productName: string }

export const addToCart = withContracts<AddToCartInput>(
  async (ctx) => {
    const { page, baseUrl, log } = ctx;

    await page.goto(`${baseUrl}/`);
    await assertVisible(ctx, page.getByRole('searchbox').locator('xpath=..').toString()); // simple example

    await fill(ctx, 'role=searchbox[name="Search"]', ctx.input.query);
    await click(ctx, 'role=button[name="Search"]');

    // Prefer role+name for product card
    const productCard = `role=article[name=${JSON.stringify(ctx.input.productName)}]`;
    await assertVisible(ctx, productCard);

    // Idempotent guard: if already in cart, do not click add again
    const alreadyInCart = await page.locator(`${productCard} >> text=In cart`).isVisible().catch(() => false);
    if (!alreadyInCart) {
      await click(ctx, `${productCard} >> role=button[name="Add to cart"]`);
    } else {
      log('Skip add: product already in cart');
    }

    // Network assertion: confirm a POST /cart event occurred or was not needed
    await ctx.drift.expectNetworkEvent(
      {
        method: 'POST',
        urlPrefix: '/api/cart',
      },
      { optional: alreadyInCart }
    );

    // Postcondition: open cart, verify product present
    await click(ctx, 'role=button[name="Cart"]');
    const cartItem = `role=listitem[name=${JSON.stringify(ctx.input.productName)}]`;
    await expect(page.locator(cartItem)).toBeVisible();

    // Schema check on cart API
    await ctx.drift.interceptJson('/api/cart', (json) => {
      const parsed = z.object({ items: z.array(ProductSchema) }).safeParse(json);
      expect(parsed.success).toBeTruthy();
      const names = (parsed.success ? parsed.data.items : []).map((i) => i.name);
      expect(names).toContain(ctx.input.productName);
    });
  },
  {
    pre: async (ctx) => {
      // Precondition: site reachable, not authenticated or authenticated consistently
      await ctx.page.goto(`${ctx.baseUrl}/health`);
      await expect(ctx.page.getByText('OK')).toBeVisible();
      await ctx.drift.reset();
    },
    post: async (ctx) => {
      // Postcondition: cart badge count >= 1 and DOM stable
      const badge = ctx.page.getByTestId('cart-badge');
      await expect(badge).toBeVisible();
      const count = parseInt(await badge.textContent() || '0', 10);
      expect(count).toBeGreaterThanOrEqual(1);
      await ctx.drift.assertDomStability({ tolerance: 0.1 });
    }
  }
);

// Use in a test
test('add to cart flow', async ({ page, request }) => {
  const ctx = makeFlowContext({ page, baseUrl: process.env.BASE_URL! });
  ctx.input = { query: 'wireless headphones', productName: 'Acme Headphones X' } as any; // set by orchestrator
  await addToCart(ctx);
});

Notes on the compiled flow

Role and name locators dominate, consistent with accessibility best practices and higher cross-browser stability.
A simple idempotent guard avoids double-adding.
Network assertion is optional when we find evidence the item was already in cart.
Postcondition enforces both semantic success and a stability signal.

Building drift monitors

Drift monitors quantify the difference between the last verified run and the current DOM or API. They should be low-cost, composable, and explainable.

DOM drift features to track

Structural: role path from root to element, depth, sibling index histogram.
Semantic: accessible name, aria-label, aria-expanded, placeholder text.
Text tokens: a bag-of-words with stopword filtering, casing normalized.
Visual: bounding box and relative position within viewport (normalized coordinates); optional perceptual hash of cropped element.
Attributes: data-testid, data-qa, rel, href, type, value presence.

An implementation sketch

ts
export class DriftMonitor {
  private domSigs: Record<string, string[]> = {};
  private netSigs: string[] = [];
  private lastReq?: { method: string; url: string; headers: Record<string, string>; body?: unknown };

  constructor(private page: Page) {}

  async captureDomSignature(locator: string, phase: string) {
    const loc = this.page.locator(locator);
    const handle = await loc.elementHandle();
    if (!handle) return;
    const sig = await handle.evaluate((el) => {
      const role = (el.getAttribute('role') || '').toLowerCase();
      const name = (el.getAttribute('aria-label') || el.textContent || '').trim().slice(0, 64);
      const rect = el.getBoundingClientRect();
      const parentRoles: string[] = [];
      let p = el.parentElement;
      for (let i = 0; i < 5 && p; i++) { parentRoles.push((p.getAttribute('role') || p.tagName).toLowerCase()); p = p.parentElement; }
      return JSON.stringify({ role, name, rect: [rect.x, rect.y, rect.width, rect.height].map((v) => Math.round(v)), parentRoles });
    });
    this.domSigs[locator] = [...(this.domSigs[locator] || []), `${phase}:${sig}`];
  }

  async assertDomStability(opts: { tolerance: number }) {
    // In a real impl, compare against a baseline stored per locator
    // Here we just ensure we captured signatures and they are not wildly empty
    for (const [loc, sigs] of Object.entries(this.domSigs)) {
      expect(sigs.length).toBeGreaterThan(0);
    }
  }

  async interceptJson(urlPrefix: string, checker: (json: unknown) => void) {
    await this.page.route('**/*', async (route) => {
      const req = route.request();
      if (req.url().includes(urlPrefix) && ['GET', 'POST', 'PUT', 'PATCH'].includes(req.method())) {
        await route.continue();
        const resp = await route.fetch(); // headless fetch to get body
        try {
          const json = await resp.json();
          checker(json);
        } catch (_) {}
      } else {
        await route.continue();
      }
    });
  }

  async lastRequest() { return this.lastReq; }

  async expectNetworkEvent(spec: { method: string; urlPrefix: string }, opts?: { optional?: boolean }) {
    // naive: in real life, attach a listener and store recent events
    if (opts?.optional) return;
    expect(this.netSigs.find((s) => s.includes(`${spec.method}:${spec.urlPrefix}`))).toBeTruthy();
  }

  async captureNetSignature(req: { method: string; url: string; headers: Record<string, string>; body?: unknown }) {
    const sig = `${req.method}:${new URL(req.url, 'http://x').pathname}`;
    this.netSigs.push(sig);
    this.lastReq = req;
  }

  async reset() { this.domSigs = {}; this.netSigs = []; this.lastReq = undefined; }
}

Network assertions and idempotent writes

Agents often power-click their way into duplicate actions. Compiled flows guard writes with read-before-write and network-level oracles.

Patterns for safe writes

Read-before-write: GET or HEAD to check existence; proceed only if needed.
Conditional updates: Use If-Match with ETag or pass a version token; retry on conflict.
Semantic checks: After POST, GET to confirm state; tolerate 200 or 201 depending on idempotency.
Schema guards: zod or io-ts to validate response shape.

Example: idempotent user creation

ts
interface UserInput { email: string; name: string }

async function ensureUser(ctx: FlowContext, input: UserInput) {
  const r = await request.newContext({ baseURL: ctx.baseUrl });
  const lookup = await r.get(`/api/users?email=${encodeURIComponent(input.email)}`);
  if (lookup.ok()) {
    const data = await lookup.json();
    if (Array.isArray(data) && data.some((u: any) => u.email === input.email)) {
      ctx.log('User exists, skipping create');
      return;
    }
  }
  const create = await r.post('/api/users', { data: input });
  expect([200, 201, 409]).toContain(create.status());
  if (create.status() === 409) ctx.log('Conflict on create, likely exists');
  const confirm = await r.get(`/api/users?email=${encodeURIComponent(input.email)}`);
  const json = await confirm.json();
  expect(Array.isArray(json)).toBeTruthy();
  expect(json.some((u: any) => u.email === input.email)).toBeTruthy();
}

Synthesizing contracts from traces

The compiler learns what to assert from the successful run.

Precondition candidates: specific route responds 200; presence of a nav bar; sign-in state.
Postcondition candidates: target element visible; URL pattern; network call succeeded and response valid.
Generalization: emit role-based locators and moderate string matching. Avoid overfitting to absolute text.

Algorithm sketch

From the trace, choose for each action the most stable locator based on a score: role match > aria-label > data-testid > CSS; penalize dynamic classes and nth-child.
From successful timings, replace raw waits with semantic expects: use toBeVisible, toBeEnabled, or waitForResponse with URL predicates.
For each write, identify the network pair (request and consequent response) and emit a balanced idempotent pattern.
Collect post-run oracles: URLs, text tokens, and API responses, and synthesize postconditions guarded by tolerance thresholds.

Auto-repair when drift is detected

Minor UI changes should not force an agent fallback. Repair strategies include:

Selector upgrade: switch to an alternate locator already recorded in the trace (role to aria, aria to data-testid, etc.).
Selector rewrite: adjust text normalization, ignore case, trim whitespace, or use partial match.
Wait semantics: replace timeout with toBeEnabled; extend to stable network waits (wait for desired response).
Nearest neighbor: use DOM signature similarity to choose a new element candidate; cap the allowed distance to avoid wild hops.
Multi-path generalization: if two or more users complete the same task via different buttons or menus, record both as acceptable.

A safe repair engine is declarative: define a small rewrite grammar and a cost model. Each transformation must preserve contracts and keep execution within a safety envelope.

Repair pseudo-implementation

ts
interface RepairContext { page: Page; drift: DriftMonitor; }

interface SelectorVariant { kind: 'role' | 'aria' | 'testid' | 'css'; value: string; score: number }

function generateVariants(fromTrace: { selectors: SelectorVariant[] }): SelectorVariant[] {
  // Sort and produce alternates with relaxed text matching
  const base = [...fromTrace.selectors].sort((a, b) => b.score - a.score);
  const relaxed = base.flatMap((s) => s.kind === 'role' ? [{ ...s, value: s.value.replace(/\s+/g, ' ').toLowerCase(), score: s.score - 0.1 }] : []);
  return [...base, ...relaxed];
}

async function tryRepair(ctx: RepairContext, target: string, candidates: SelectorVariant[]): Promise<string | null> {
  for (const c of candidates) {
    const locator = buildLocator(c);
    try {
      const handle = await ctx.page.locator(locator).first();
      const visible = await handle.isVisible({ timeout: 1500 }).catch(() => false);
      if (!visible) continue;
      const sim = await ctx.drift.similarity(target, locator);
      if (sim >= 0.85) return locator; // threshold tuned by data
    } catch (_) {}
  }
  return null;
}

Execution engine: prefer compiled, fall back on failure

The orchestrator should run compiled flows first and only involve the agent when contracts cannot be satisfied or the drift distance exceeds the repair budget.

ts
enum Outcome { Success, RepairedAndSuccess, NeedsAgent, HardFail }

interface OrchestratorInput { flow: Op<any>; ctx: FlowContext; repairBudget: number }

export async function runWithFallback({ flow, ctx, repairBudget }: OrchestratorInput): Promise<Outcome> {
  try {
    await flow(ctx);
    return Outcome.Success;
  } catch (e: any) {
    ctx.log('Flow failed, attempting repair', { message: String(e) });

    const repairResult = await attemptRepairs(ctx, { budget: repairBudget });
    if (repairResult.ok) {
      try {
        await flow(ctx);
        return Outcome.RepairedAndSuccess;
      } catch (e2) {
        ctx.log('Post-repair flow still failing, escalating to agent', { message: String(e2) });
      }
    }

    const agentResult = await runAgent(ctx);
    if (agentResult.success) {
      await proposePatch(ctx, agentResult.trace);
      return Outcome.NeedsAgent;
    }
    return Outcome.HardFail;
  }
}

Security and privacy notes

Secrets and tokens: route-based interceptors should redact Authorization, cookies, and PII in traces and drift logs.
Least privilege: compiled flows and agents should run with separate credentials. Agents often need more scopes for discovery; compiled flows should not.
Test data: flows that mutate data should operate in sandbox tenants or use ephemeral IDs; idempotent writes reduce risk but do not eliminate it.

CI and developer experience

Code review: generated flows live in git, with readable diffs and type checks. The compiler should emit comments explaining synthesizer choices.
Test runners: run flows as Playwright tests, collect traces, and publish HTML reports that include drift deltas and repair attempts.
Golden baselines: store DOM and network signatures alongside the test to version the contract with the code.
Dashboards: a change feed for high-drift selectors; a repair queue that proposes code diffs for approval.

Evaluation metrics

Reliability: percent of runs that succeed without agent fallback; MTTF for compiled paths.
Repair efficacy: fraction of failures fixed by single-step repair; average edit distance of fixed selectors.
Latency: p50 and p95 run time compared to agent-only execution.
Cost: credits or API dollars saved by compiled-first strategy.
Safety: count of idempotent-guarded writes; number of accidental duplicates prevented.

Empirical tips from the field

Annotate the DOM: if you own the app, add data-testid liberally. A small addition pays off hugely in selector stability.
Prefer role-based locators even when data-testid exists; roles make your flows robust to cosmetic changes.
Keep contracts small and layered: cheap preconditions guard expensive operations; postconditions do not overspecify layout.
Cache network schemas: drift at the API layer often explains UI oddities; use schema diffs to drive priority triage.
Visual checks sparingly: pixel diffs are flaky. Use perceptual hash at low resolution or defer to DOM semantics.

Limitations and open problems

Cross-tenant variability: content-heavy pages with personalized slots reduce selector stability. Consider plan-time masking.
Big-bang redesigns: repair breaks down when entire flow changes. Rapid agent fallback and new synthesis are necessary.
CAPTCHAs and anti-automation: compiled flows cannot solve these by design; the orchestrator must route around or obtain test exemptions.
Multi-tab, multi-window interactions: flows must encode window handles and modals precisely; drift monitors need window-aware signatures.
Long-lived sessions: preserving authenticated state requires careful cookie and storage isolation in CI.

How the compiler handles non-trivial patterns

File uploads: traces include input[type=file] and drag-drop events; synthesized flows resolve absolute file paths in CI containers and assert Content-Type on network upload.
Rich text editors: clicks and input at contentEditable nodes are synthesized with selection-based typing and assert serialization via API hooks.
Infinite lists and virtualizers: flows precompute scroll targets and use IntersectionObserver signatures; repair uses nearest neighbor tokens.
Internationalization: the compiler resolves accessible names via translation dictionaries; contracts assert semantic roles not raw text.

Working example end-to-end

Agent explores and buys a plan: search, open plan, add to cart, checkout. The trace collects:
- DOM actions with 3–5 selector candidates each.
- Network: POST /cart, GET /cart, POST /checkout.
- Oracles: final URL ends with /order/thank-you, text contains Order confirmed.
Compiler emits addToCart and checkout flows with:
- Role-based locators and data-testid fallback.
- Pre: health endpoint check and sign-in status.
- Post: order id parsed from DOM and verified via GET /orders/{id}.
- Idempotent checkout: if order already exists for session, skip POST /checkout.
Drift monitors baseline:
- DOM signatures for buttons and forms; API schemas for cart and checkout.
A week later, design tweaks the cart button label from Cart to Bag and changes hierarchy.
- First run fails the Cart locator; drift monitor suggests role=button[name~Bag] candidate with 0.88 similarity.
- Auto-repair applies and the flow passes; a patch PR updates the selector with a relaxed match.
A month later, checkout switches to a new endpoint /v2/checkout.
- The network assertion fails; drift shows 301 redirect to v2. Auto-repair updates the URL prefix in the assertion; contracts still pass.
A quarter later, the checkout flow adds a mandatory address form. Contracts fail at the new form precondition. Drift distance too large; orchestrator falls back to the agent; new trace includes the address fill steps. Compiler synthesizes a new step group gated by a precondition checking for the address pane. Review lands the patch.

Why this is better than agent-only

Lower variance: code’s timing and assertions tame web flakiness.
Clear failure modes: contracts localize faults. Agents often fail silently or for ambiguous reasons.
Cooperative: agents still do the unglamorous work of exploration, but only when necessary.
Auditable: compliance, security, and QA can read contracts and proofs instead of sifting logs.

Implementation checklist for teams

Agent instrumentation: emit traces with DOM and network payloads; define a stable schema.
Compiler MVP: choose a target subset of Playwright; wrap with a small TypeScript DSL for contracts and drift capture.
Locator scorer: implement a weighted selector choice model; prefer role and name.
Drift engine: a signature library for DOM and network; similarity metrics and thresholds.
Repair rules: a minimal grammar with safety checks; a PR generator that edits code in small deltas.
Orchestrator: compiled-first runner with budgets, telemetry, and explicit agent fallback.
CI: baselines and reports; automatic runs on pull requests and nightly against staging.

Appendix: heuristics that work well

Strip dynamic attributes: rm class tokens that look like hashes; ignore nth-child.
Tokenize text by words; ignore case and punctuation; keep 3–7 tokens for matching.
When choosing between multiple matches, prefer the element with the closest ancestor role path to baseline; tie-break by distance in viewport.
Make waits compositional: prefer wait for a network response that contains a JSON key you care about, not entire page loads.
Timebox auto-repair: if no fix in 2–5 seconds, bail to the agent to avoid spiraling costs.

Closing thoughts

Trace-to-code reframes browser automation as a compilation problem, not a prompting one. LLMs are excellent at creating and editing code; they are less excellent at serving as a perpetual runtime. By capturing a successful plan, encoding it as a typed Playwright flow with contracts and assertions, and installing drift monitors with a safe auto-repair loop, you get the best of both worlds: the creativity of agents and the reliability of tests.

You can start small. Instrument your agent. Synthesize one flow. Add a couple of contracts. Put the flow in CI. Once the first repair lands automatically, you will not go back to ad hoc agent runs.