Building Deterministic Browser Agents: DOM-Aware Planning, Action Validation, and Recovery Loops

The failure usually doesn’t look dramatic in logs. It looks like a successful click.

A browser agent running a retail checkout flow finds a button labeled “Continue,” clicks it, gets a 200 OK on a background request, and then just sits there until the step timeout expires. The screenshot after the click looks almost identical to the screenshot before the click. The trace shows no JavaScript exception. The DOM still contains the button, and the action engine marks the click as completed.

In production, this kind of failure is far more common than the obvious ones.

The root problem is that most browser agents are nondeterministic in places where product teams assume determinism. They assume a target is stable because it was visible once. They assume a click succeeded because the automation framework emitted no error. They assume the page state changed because a request happened. They assume replayed selectors from a successful run will keep working tomorrow. Those assumptions break quickly on modern websites where rendering is asynchronous, accessibility layers are inconsistent, frontends recycle DOM nodes, overlays intercept pointer events, and headless execution diverges from headed execution in subtle but important ways.

If you want browser agents to reliably complete multi-step tasks like shopping, booking, onboarding, or data extraction, you need to stop thinking of automation as “find element and click.” A production-grade agent is closer to a transactional system. It needs state modeling, target scoring, action preconditions, post-action validation, recovery paths, and evidence collection strong enough to explain why a step was considered complete.

This article is about building that system.

I’ll focus on deterministic execution rather than open-ended autonomy: using DOM-aware planning, action validation, and recovery loops to make browser agents behave predictably under production conditions. I’ll also cover how to generate training and evaluation data from human demonstrations and replay logs, because most teams eventually realize that prompting alone doesn’t fix flaky action selection.

A real production failure: the button was real, the click was real, the action was not

Consider a booking flow with this simplified page structure:

html
<div id="checkout-root">
  <section aria-label="Traveler details">
    <form>
      <input name="firstName" />
      <input name="lastName" />
      <button type="button" class="primary">Continue</button>
    </form>
  </section>

  <div class="sticky-footer">
    <button type="button" class="primary">Continue</button>
  </div>

  <div class="loading-mask hidden"></div>
</div>

The visible UI includes both an in-form “Continue” button and a sticky footer “Continue” button. On desktop they are synchronized. On narrower layouts, the sticky button becomes the actionable control and the in-form button remains visible but is no longer wired to the current validation path.

A naive agent uses one of these targeting strategies:

first button whose text contains Continue
highest z-index visible button
XPath based on the container seen in a prior run
LLM-chosen target from raw HTML

All of those can fail.

Here is the kind of Playwright log you see:

text
[step=7] goal=submit_traveler_details
[locate] candidates=4 query="button:has-text('Continue')"
[action] click selector=button.primary nth=0
[action] success click emitted
[post] waited_for=networkidle timeout=5000ms
[post] screenshot_diff=0.8%
[post] url_unchanged=true
[post] required_field_errors=2
[post] step_result=timeout waiting for state transition

Nothing “failed” at the browser API level. But the business action failed.

This is the distinction that matters in browser agents:

interaction success: the browser automation library delivered the intended low-level event
state transition success: the application moved to the expected next state

Most flaky systems only measure the first.

Root cause: ambiguous targets plus weak state semantics

The root cause of unreliable browser agents is usually a combination of four issues.

1. The target representation is too shallow

Raw HTML or screenshots alone are not enough. HTML gives structural detail but misses rendered visibility, computed interactivity, and user-perceived grouping. Screenshots show layout but not semantic relationships. Accessibility trees help with names and roles but can be incomplete or stale on some frameworks.

A robust agent target should be defined using a fused representation:

DOM node identity and ancestry
accessibility role and accessible name
visual bounding box and occlusion status
nearby labels, headings, and container semantics
inferred actionability state: visible, enabled, stable, editable, selected

2. The planner confuses label matching with intent resolution

“Click Continue” is not the same as “activate the control that advances the traveler details form after validation passes.” Intent resolution requires context: which region, which form, which step of the funnel, what errors are present, and what transition should happen.

3. The executor lacks transactional checks

Before acting, it should verify preconditions. After acting, it should validate observable outcomes. If no state transition occurred, it must branch into recovery rather than blindly retry the same event.

4. The system tracks too little state across steps

Production flows require memory of:

current task stage
fields already attempted
validation errors seen
navigation history
modal interruptions
per-site selector reliability
element fingerprints for stale element recovery

Without that, recovery loops become random walks.

Why naive approaches fail

Teams usually start with one of three strategies.

Strategy 1: Pure selector automation

This is the classic scripted approach: define CSS/XPath selectors and hardcode the flow.

It works until any of the following happens:

class names are regenerated
DOM nesting changes
A/B variants reorder controls
sticky mobile controls duplicate visible actions
iframes or shadow DOM are introduced
localized text changes accessible names

Selectors are still necessary, but selector-only systems encode assumptions too early and too rigidly.

Strategy 2: Screenshot-first agents

Vision-heavy agents can pick visually obvious elements, but they often struggle with:

offscreen elements
hidden overlays intercepting events
labels detached from inputs in responsive layouts
precise text entry into the correct field among visually similar fields
post-action validation when visible change is minimal

Screenshots are evidence, not sufficient state.

Strategy 3: Raw-HTML LLM planning

Feeding the page HTML to a model and asking what to click sounds appealing, but raw HTML contains too much irrelevant detail and too little runtime truth. Hidden templates, duplicate nodes, stale SSR markup, feature flags, and detached nodes all pollute the context. The model may choose nodes that exist in source but are not actionable in the rendered page.

The failure mode is especially bad when the chosen action is semantically plausible but operationally wrong. That produces silent failures that are harder to detect than exceptions.

The architecture that works better

A deterministic browser agent should separate the problem into explicit layers:

Page state extraction
Target candidate generation
Intent-conditioned target ranking
Action planning with preconditions
Execution with instrumentation
Post-action validation
Recovery loop and replanning
State persistence and evaluation logging

A useful mental model is a small transaction engine running inside the browser session.

State model

At minimum, define these entities:

python
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any

@dataclass
class ElementFingerprint:
    tag: str
    role: Optional[str]
    name: Optional[str]
    text: Optional[str]
    dom_path: List[str]
    attributes: Dict[str, str]
    bbox: Dict[str, float]
    nearby_labels: List[str] = field(default_factory=list)

@dataclass
class AgentStep:
    step_id: str
    intent: str
    target_hint: Optional[str]
    expected_outcomes: List[str]
    retries: int = 0

@dataclass
class PageSnapshot:
    url: str
    title: str
    dom_hash: str
    screenshot_path: str
    elements: List[ElementFingerprint]
    forms: List[Dict[str, Any]]
    modals: List[Dict[str, Any]]
    errors: List[str]

@dataclass
class ExecutionState:
    task_id: str
    current_step: AgentStep
    page: PageSnapshot
    history: List[Dict[str, Any]] = field(default_factory=list)
    seen_validation_errors: List[str] = field(default_factory=list)
    site_profile: Dict[str, Any] = field(default_factory=dict)

This is not glamorous, but it’s where reliability comes from. The agent needs structured memory, not just prompt context.

Building a DOM-aware page model

The page model should combine DOM, accessibility, and visual facts. Playwright gives you enough primitives to build this.

Extract interactive elements with browser-side inspection

In practice, I recommend injecting a browser-side collector that walks the DOM and returns compact action-oriented metadata.

javascript
function collectInteractiveElements() {
  const isVisible = (el) => {
    const style = window.getComputedStyle(el);
    if (style.visibility === 'hidden' || style.display === 'none') return false;
    const rect = el.getBoundingClientRect();
    return rect.width > 0 && rect.height > 0;
  };

  const roleOf = (el) => {
    const explicit = el.getAttribute('role');
    if (explicit) return explicit;
    const tag = el.tagName.toLowerCase();
    if (tag === 'button') return 'button';
    if (tag === 'a' && el.href) return 'link';
    if (tag === 'input') {
      const type = (el.getAttribute('type') || 'text').toLowerCase();
      if (['submit', 'button'].includes(type)) return 'button';
      if (['checkbox'].includes(type)) return 'checkbox';
      if (['radio'].includes(type)) return 'radio';
      return 'textbox';
    }
    if (tag === 'select') return 'combobox';
    if (tag === 'textarea') return 'textbox';
    return null;
  };

  const accessibleName = (el) => {
    const aria = el.getAttribute('aria-label');
    if (aria) return aria.trim();
    const labelledBy = el.getAttribute('aria-labelledby');
    if (labelledBy) {
      const text = labelledBy
        .split(/\s+/)
        .map(id => document.getElementById(id)?.innerText || '')
        .join(' ')
        .trim();
      if (text) return text;
    }
    if (el.labels && el.labels.length) {
      const labelText = Array.from(el.labels).map(l => l.innerText).join(' ').trim();
      if (labelText) return labelText;
    }
    return (el.innerText || el.value || el.getAttribute('placeholder') || '').trim();
  };

  const interactiveSelector = [
    'button',
    'a[href]',
    'input',
    'select',
    'textarea',
    '[role="button"]',
    '[role="link"]',
    '[role="textbox"]',
    '[role="checkbox"]',
    '[role="radio"]',
    '[tabindex]'
  ].join(',');

  const nodes = Array.from(document.querySelectorAll(interactiveSelector));

  return nodes
    .filter(isVisible)
    .map((el, index) => {
      const rect = el.getBoundingClientRect();
      return {
        uid: `el_${index}`,
        tag: el.tagName.toLowerCase(),
        role: roleOf(el),
        name: accessibleName(el),
        text: (el.innerText || '').trim().slice(0, 200),
        id: el.id || null,
        classes: Array.from(el.classList).slice(0, 6),
        disabled: el.disabled || el.getAttribute('aria-disabled') === 'true',
        checked: el.checked ?? null,
        href: el.getAttribute('href'),
        type: el.getAttribute('type'),
        bbox: {
          x: rect.x,
          y: rect.y,
          width: rect.width,
          height: rect.height,
        },
        xpathHint: (() => {
          const parts = [];
          let node = el;
          while (node && node.nodeType === Node.ELEMENT_NODE && parts.length < 6) {
            let part = node.tagName.toLowerCase();
            if (node.id) {
              part += `#${node.id}`;
              parts.unshift(part);
              break;
            }
            const cls = Array.from(node.classList).slice(0, 2).join('.');
            if (cls) part += `.${cls}`;
            parts.unshift(part);
            node = node.parentElement;
          }
          return parts.join(' > ');
        })(),
      };
    });
}

From Playwright:

python
elements = await page.evaluate("collectInteractiveElements()")

This collection step should be cheap enough to run before every major action and after every meaningful state change.

Add accessibility tree data

Playwright can expose an accessibility snapshot. It is not perfect, but it is very useful for de-duplicating semantically equivalent targets and identifying user-facing names.

python
ax_tree = await page.accessibility.snapshot()

In practice, I flatten the tree into records like:

python
def flatten_ax(node, out=None, path=None):
    out = out or []
    path = path or []
    if not node:
        return out
    out.append({
        "role": node.get("role"),
        "name": node.get("name"),
        "value": node.get("value"),
        "path": path[:],
    })
    for i, child in enumerate(node.get("children", []) or []):
        flatten_ax(child, out, path + [i])
    return out

The key is not to trust any single representation. Use AX names to improve ranking, not as the sole source of truth.

Add visual and occlusion checks

A visible element is not always clickable. Overlays, sticky headers, cookie banners, and loading masks often intercept pointer events.

This browser-side helper catches many of those cases:

javascript
function actionabilityProbe(selector) {
  const el = document.querySelector(selector);
  if (!el) return { ok: false, reason: 'not_found' };
  const rect = el.getBoundingClientRect();
  const cx = rect.left + rect.width / 2;
  const cy = rect.top + rect.height / 2;
  const topEl = document.elementFromPoint(cx, cy);
  const visible = rect.width > 0 && rect.height > 0;
  const enabled = !el.disabled && el.getAttribute('aria-disabled') !== 'true';
  const unobstructed = topEl === el || el.contains(topEl);
  return {
    ok: visible && enabled && unobstructed,
    visible,
    enabled,
    unobstructed,
    topTag: topEl?.tagName?.toLowerCase() || null,
    topClass: topEl?.className || null,
  };
}

This is far more useful than “element exists.”

Stable target selection: treat it as ranking, not exact match

The agent should generate multiple candidate targets and score them against the current intent.

For a step like “continue from traveler details,” candidate scoring can combine:

text/name similarity to “continue”
role match (button, link)
same form/section as relevant inputs
proximity to validation messages or form container
enabled state
unobstructed status
historical reliability for this site/flow
whether activating it previously led to the expected state transition

Here is a simple scoring example in Python:

python
from rapidfuzz import fuzz

def score_candidate(candidate, intent, context):
    score = 0
    name = (candidate.get("name") or candidate.get("text") or "").lower()
    role = candidate.get("role")

    score += fuzz.partial_ratio(name, intent.lower()) * 0.3

    if role == "button":
        score += 20

    if candidate.get("disabled"):
        score -= 50

    if context.get("required_container") and context["required_container"] in (candidate.get("xpathHint") or ""):
        score += 25

    if candidate.get("unobstructed"):
        score += 15

    reliability = context.get("site_selector_reliability", {}).get(candidate.get("xpathHint"), 0)
    score += reliability * 10

    return score

The production lesson here is important: selector resolution should produce a ranked list with evidence, not a single brittle answer.

Precondition checks: never execute blind

Before every action, run precondition checks specific to the action type.

For click actions

Check:

element still exists
visible within viewport or can be scrolled into view
enabled
not covered
no loading mask active
page not mid-navigation

For text entry

Check:

field is editable
field has focus after click/focus event
input method events are accepted
any masking or formatting script has attached
existing value handling is defined (append vs replace)

For select/combo interactions

Check:

native select vs custom component
whether options render in portal/outside container
whether typing triggers async search

Example Playwright wrapper:

python
async def safe_click(page, locator, expected=None, timeout=5000):
    await locator.scroll_into_view_if_needed()
    await locator.wait_for(state="visible", timeout=timeout)
    await locator.wait_for(state="attached", timeout=timeout)

    before_url = page.url
    before_title = await page.title()

    box = await locator.bounding_box()
    if not box:
        raise RuntimeError("click_precondition_failed: no_bounding_box")

    await locator.click(timeout=timeout)

    return {
        "before_url": before_url,
        "before_title": before_title,
        "clicked_box": box,
        "expected": expected or [],
    }

That’s still incomplete without post-action validation, but it’s better than raw click().

Post-action validation: define what success means

Every action should have explicit expected outcomes.

For example, after clicking a checkout “Continue” button, acceptable outcomes might be:

URL changed to /payment
heading changed to “Payment”
a payment form appears
form validation errors appear and block progression

The last case matters. Validation errors are not necessarily execution failures. They may be valid state transitions that should hand control back to the planner.

Here is a validation function pattern:

python
async def validate_outcomes(page, expected_outcomes, timeout=5000):
    observed = []
    start = page.url

    try:
        await page.wait_for_load_state("domcontentloaded", timeout=1500)
    except Exception:
        pass

    if any(o.startswith("url_contains:") for o in expected_outcomes):
        for o in expected_outcomes:
            if o.startswith("url_contains:"):
                needle = o.split(":", 1)[1]
                if needle in page.url:
                    observed.append(o)

    if any(o.startswith("heading:") for o in expected_outcomes):
        headings = await page.locator("h1, h2, [role='heading']").all_inner_texts()
        joined = " | ".join(headings)
        for o in expected_outcomes:
            if o.startswith("heading:"):
                needle = o.split(":", 1)[1].lower()
                if needle in joined.lower():
                    observed.append(o)

    errors = await page.locator("[aria-invalid='true'], .error, [role='alert']").all_inner_texts()
    if errors:
        observed.append("validation_error_present")

    return {
        "success": len(observed) > 0,
        "observed": observed,
        "url_changed": page.url != start,
        "errors": errors,
    }

The key production practice is to define business-complete success per step, not generic browser success.

Action pipeline design

A good execution loop looks something like this:

python
async def execute_step(page, planner, state):
    step = state.current_step

    snapshot = await planner.capture_snapshot(page)
    candidates = planner.generate_candidates(snapshot, step)
    ranked = planner.rank_candidates(candidates, step, state)

    last_error = None

    for candidate in ranked[:5]:
        try:
            locator = planner.resolve_locator(page, candidate)
            planner.check_preconditions(candidate, snapshot, step)

            action_meta = await planner.perform_action(page, locator, step)
            validation = await validate_outcomes(page, step.expected_outcomes)

            state.history.append({
                "step_id": step.step_id,
                "candidate": candidate,
                "action_meta": action_meta,
                "validation": validation,
            })

            if validation["success"]:
                return validation

            recovered = await planner.attempt_recovery(page, state, validation)
            if recovered:
                return recovered

        except Exception as e:
            last_error = str(e)
            state.history.append({
                "step_id": step.step_id,
                "candidate": candidate,
                "error": last_error,
            })

    raise RuntimeError(f"step_failed: {step.step_id} last_error={last_error}")

This is intentionally explicit. When systems get flaky, you want logs that answer:

which candidates were considered?
why was one chosen?
were preconditions met?
what exact event was sent?
what outcome was observed?
what recovery branch ran?

Generating training data from human demonstrations and replay logs

If you are using an ML model to rank targets, choose actions, or predict recovery strategies, the best data usually comes from two sources:

human demonstrations
replay telemetry from automated runs

Human demonstrations

Instrument a browser session and record:

DOM snapshots before each action
accessibility snapshot
screenshot
viewport size and user agent
action type, coordinates, target element fingerprint
keypress sequence for text entry
resulting DOM delta and navigation events

From these demonstrations, derive supervised examples like:

given current page state and intent, rank the acted-on element highest
given current field and value, generate the correct entry operation
given post-click state, classify outcome: success, validation_error, no_effect, blocked_modal, navigation_change

A practical schema:

json
{
  "task_id": "booking_1842",
  "step_id": "traveler_continue",
  "intent": "continue from traveler details",
  "page_url": "https://site.example/checkout/travelers",
  "viewport": {"width": 1440, "height": 900},
  "elements": [{"uid": "el_17", "role": "button", "name": "Continue"}],
  "chosen_uid": "el_17",
  "action": {"type": "click"},
  "outcome": {"class": "success", "url": "/payment"}
}

Replay logs

Replay data is just as valuable because it captures failure distributions humans don’t create intentionally.

Log every automated step with:

candidate list and scores
final chosen target
locator actually used
actionability probe result
browser events observed
validation results
recovery strategy triggered
final step classification

This data helps train or tune:

candidate ranking
retry budgets
site-specific selector priors
modal detection heuristics
timeout policies

Replay logs are especially good for hard negative examples: elements that looked correct but led to no state transition.

Headless-specific issues you need to engineer around

Headless runs behave differently. If you ignore that, your evaluation results will lie to you.

Rendering drift

Layout can shift between headed and headless due to fonts, GPU differences, viewport defaults, animation timing, and reduced compositor behavior.

Mitigations:

pin viewport size
pin browser version
install deterministic fonts in the runtime image
disable or reduce animations when possible
compare bounding-box distributions between environments

Example:

python
browser = await playwright.chromium.launch(
    headless=True,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--font-render-hinting=none",
    ],
)
context = await browser.new_context(
    viewport={"width": 1440, "height": 900},
    locale="en-US",
    timezone_id="America/New_York",
)

Timing and async hydration

Modern apps often render a visible control before event handlers are attached. A click can land during hydration and do nothing.

Naive fix: add sleep(2).

Better fix:

wait for element to be actionable, not merely visible
detect framework idle signals where available
retry only after proving no state transition occurred

You can often detect hydration races by logging whether the same click works after a short delay without any page changes in between.

Anti-bot friction

This isn’t only CAPTCHAs. Common friction includes:

hidden honeypot inputs
stricter rate limits in headless
challenge pages on suspicious navigation sequences
blocked clipboard/paste paths
form submission rejected without full user event chain

Mitigations are contextual and policy-sensitive, but technically you should:

model challenge detection as a first-class state
distinguish between site failure and anti-bot gating
collect screenshots, DOM markers, and response patterns for challenge pages

Missing user events

Some sites require a richer event sequence than fill() or synthetic value assignment.

For masked inputs, date pickers, and certain React/Vue controlled components, success may require:

focus
click
keydown/keypress/input/change/blur sequence
tab navigation to trigger validators

Example for more human-like controlled input handling:

python
async def enter_text_like_user(locator, value):
    await locator.click()
    await locator.press("Control+A")
    await locator.press("Backspace")
    for ch in value:
        await locator.type(ch, delay=35)
    await locator.blur()

This is slower than fill(), but often more reliable for high-friction forms.

Recovery loops: the difference between resilient and flaky systems

Recovery should be explicit and categorized. Don’t just “retry three times.”

A useful recovery taxonomy:

stale target
occluded target
unexpected modal
navigation divergence
partial form failure
anti-bot/challenge state
no-effect action

1. Stale element recovery

DOM frameworks frequently rerender nodes, invalidating prior handles.

Symptoms:

detached element errors
click succeeds on an outdated node reference
post-action checks show unchanged state

Recovery:

reacquire from current snapshot using fingerprint similarity
prefer semantic re-resolution over handle reuse

python
async def reacquire_by_fingerprint(page, fp):
    candidates = page.locator(fp["tag"])
    count = await candidates.count()
    best = None
    best_score = -1
    for i in range(count):
        loc = candidates.nth(i)
        text = (await loc.inner_text() or "").strip() if await loc.is_visible() else ""
        score = 0
        if fp.get("name") and fp["name"].lower() in text.lower():
            score += 20
        if score > best_score:
            best = loc
            best_score = score
    return best

Cookie banners, sign-in prompts, promo popups, and location dialogs derail agents constantly.

Treat modal detection as part of every precondition check.

python
async def dismiss_known_interruptions(page):
    modal_selectors = [
        "[role='dialog']",
        ".modal",
        ".cookie-banner",
        "#onetrust-banner-sdk",
    ]
    for sel in modal_selectors:
        count = await page.locator(sel).count()
        if count == 0:
            continue
        for label in ["Accept", "Close", "Not now", "Dismiss", "Continue without"]:
            btn = page.get_by_role("button", name=label)
            try:
                if await btn.count() > 0:
                    await btn.first.click(timeout=1000)
                    return True
            except Exception:
                pass
    return False

This should be paired with site-specific handlers because generic modal logic only gets you part of the way.

Sometimes the click works but lands on the wrong route, login wall, or upsell page.

Recovery requires route classification, not generic retry.

Maintain a small page classifier using URL patterns, headings, and key DOM markers.

python
def classify_page(url, headings, body_text):
    if "login" in url or "sign in" in body_text.lower():
        return "login_gate"
    if "captcha" in url or "verify you are human" in body_text.lower():
        return "challenge"
    if any("payment" in h.lower() for h in headings):
        return "payment"
    return "unknown"

If the classifier says login_gate, don’t retry the click that got you there. Branch the plan.

4. Partial form failures

A common production case: several fields were accepted, one failed validation, and the step didn’t advance.

The wrong response is to refill everything or restart the page.

Better response:

parse visible validation errors
map errors to fields
update only affected fields
preserve accepted values

python
async def collect_field_errors(page):
    errors = []
    alerts = page.locator("[role='alert'], .error-message, .field-error")
    for i in range(await alerts.count()):
        text = await alerts.nth(i).inner_text()
        errors.append(text.strip())
    return errors

Then map error strings to labels/inputs using container proximity.

Selector design tradeoffs

There is no universally stable selector strategy. You need layered fallbacks.

explicit test IDs if you control the site
accessible role + accessible name
label-to-input relationships
stable semantic attributes (name, type, autocomplete)
container-constrained text selectors
structural CSS/XPath only as a last resort

Why not just use text selectors everywhere?

Because text is duplicated, localized, and context dependent. “Continue,” “Apply,” “Save,” and “Select” appear everywhere.

Why not just use XPath from replay?

Because replayed structure tends to be brittle across redesigns and experiments. XPath is useful as one fingerprint signal, not as canonical identity.

A better pattern: store selector bundles

Instead of one selector, persist a bundle:

json
{
  "role_name": {"role": "button", "name": "Continue"},
  "css": "button.primary",
  "container": "section[aria-label='Traveler details']",
  "attributes": {"type": "button"},
  "text": "Continue"
}

At runtime, resolve through the bundle and score the matches.

State tracking tradeoffs

A deterministic agent needs enough state to recover, but too much state can make behavior opaque and expensive.

Track these consistently:

current canonical page class
current task step and expected outcomes
last successful action target fingerprint
known interruptions handled
field values already submitted
validation errors seen and resolved
per-domain timing profile
retry counters by failure category

Avoid storing arbitrary full prompts as your source of truth. Store structured state and derive prompts from it if needed.

Evaluation metrics that reflect real production reliability

Success rate alone is misleading.

A browser agent that succeeds 80% of tasks but silently mis-clicks for 20 seconds before failing is operationally much worse than one that fails fast and classifies the issue correctly.

Measure at least:

task completion rate
step completion rate
state transition precision: fraction of actions considered successful that truly advanced or correctly changed state
target selection precision@k
mean recovery depth: how many recovery branches before success/failure
time to completion
unexpected navigation rate
modal interruption recovery rate
partial form repair success rate
headless vs headed parity

For extraction pipelines, also measure:

field-level accuracy
duplicate extraction rate
pagination continuity
stale content rate after navigation

For shopping/booking:

cart consistency after each step
price drift detection
inventory/availability revalidation rate
confirmation page precision

Practical Playwright patterns for production

A few patterns consistently help.

Use locator APIs, but do not outsource reasoning to them

Playwright locators are excellent for execution robustness, but you still need your own target ranking and validation logic.

python
button = page.get_by_role("button", name="Continue")
await button.click()

This is good execution syntax, not a complete action policy.

Capture evidence on every failure branch

python
import time

async def capture_debug_bundle(page, prefix="failure"):
    ts = int(time.time() * 1000)
    screenshot = f"artifacts/{prefix}_{ts}.png"
    html = f"artifacts/{prefix}_{ts}.html"
    await page.screenshot(path=screenshot, full_page=True)
    content = await page.content()
    with open(html, "w", encoding="utf-8") as f:
        f.write(content)
    return {"screenshot": screenshot, "html": html}

When debugging flaky agents, screenshots without HTML or HTML without screenshots are often insufficient. Save both.

Distinguish timeout classes

A timeout waiting for selector visibility is not the same as a timeout waiting for state transition after action.

Log them differently:

text
error_class=target_not_visible
error_class=post_action_no_state_change
error_class=navigation_diverged
error_class=modal_blocking
error_class=challenge_detected

This classification is essential for prioritizing engineering work.

Putting it together: a deterministic step executor

Here is a condensed example tying the pieces together.

python
class DeterministicAgent:
    def __init__(self, planner):
        self.planner = planner

    async def run_step(self, page, state):
        snapshot = await self.planner.capture_snapshot(page)

        if await dismiss_known_interruptions(page):
            snapshot = await self.planner.capture_snapshot(page)

        candidates = self.planner.generate_candidates(snapshot, state.current_step)
        ranked = self.planner.rank_candidates(candidates, state.current_step, state)

        for candidate in ranked[:5]:
            try:
                locator = self.planner.resolve_locator(page, candidate)
                pre = await self.planner.probe_actionability(page, candidate)
                if not pre["ok"]:
                    continue

                await self.planner.perform(page, locator, state.current_step)
                result = await validate_outcomes(page, state.current_step.expected_outcomes)

                if result["success"]:
                    return {"status": "success", "result": result}

                field_errors = await collect_field_errors(page)
                if field_errors:
                    return {"status": "needs_repair", "errors": field_errors}

                page_text = await page.locator("body").inner_text()
                headings = await page.locator("h1, h2, [role='heading']").all_inner_texts()
                page_class = classify_page(page.url, headings, page_text)
                if page_class != "unknown":
                    return {"status": "branch", "page_class": page_class}

            except Exception as e:
                await capture_debug_bundle(page, prefix="step_error")
                state.history.append({"candidate": candidate, "error": str(e)})

        return {"status": "failed"}

This isn’t everything you need, but it demonstrates the core design principle: the agent decides success based on validated state, not on whether the low-level action API returned normally.

Production considerations for shopping, booking, and extraction systems

Shopping flows

Key issues:

variant selection changes DOM structure
price and stock mutate asynchronously
upsell modals appear between steps
cart pages include multiple semantically similar CTAs

Focus on:

cart state verification after each mutation
SKU/variant identity tracking
explicit quantity validation
confirmation that add-to-cart actually changed cart count or line items

Booking flows

Key issues:

heavily dynamic forms
date pickers and passenger selectors as custom widgets
session expiration mid-flow
duplicated continue/next controls for responsive layouts

Focus on:

field-level validation parsing
widget-specific action adapters
page classification for each funnel stage
route and heading-based transition validation

Data extraction flows

Key issues:

lazy-loaded lists
stale content after filter changes
pagination controls that rerender in place
rate limiting and challenge pages

Focus on:

content freshness checks after filters
duplicate detection across pages
extraction schema validation
challenge-state detection before blaming selectors

Takeaways

Reliable browser agents are not built by making one selector smarter or one prompt longer. They get better when you move from optimistic interaction to verified state transitions.

The engineering pattern that holds up in production is straightforward:

build a fused page model from DOM, accessibility, and visual/actionability cues
generate multiple candidates and rank them against intent and context
run action-specific precondition checks
define post-action success in terms of expected state transitions
classify failures by category, not just exception type
implement explicit recovery loops for stale elements, modals, navigation changes, and partial form failures
log enough evidence to train better rankers and debug silent failures
evaluate with metrics that reflect business correctness, not just whether click() returned

If you do those things, your browser agent becomes much more deterministic, and determinism is what lets you operate these systems at production scale.

That doesn’t eliminate uncertainty. Websites still change, headless behavior still drifts, and anti-bot systems still interfere. But with DOM-aware planning, action validation, and recovery loops, failures become explainable, bounded, and fixable. That’s the difference between a demo agent and an operational one.

A real production failure: the button was real, the click was real, the action was not

Root cause: ambiguous targets plus weak state semantics

1. The target representation is too shallow

2. The planner confuses label matching with intent resolution

3. The executor lacks transactional checks

4. The system tracks too little state across steps

Why naive approaches fail

Strategy 1: Pure selector automation

Strategy 2: Screenshot-first agents

Strategy 3: Raw-HTML LLM planning

The architecture that works better

State model

Building a DOM-aware page model

Extract interactive elements with browser-side inspection

Add accessibility tree data

Add visual and occlusion checks

Stable target selection: treat it as ranking, not exact match

Precondition checks: never execute blind

For click actions

For text entry

For select/combo interactions

Post-action validation: define what success means

Action pipeline design

Generating training data from human demonstrations and replay logs

Human demonstrations

Replay logs

Headless-specific issues you need to engineer around

Rendering drift

Timing and async hydration

Anti-bot friction

Missing user events

Recovery loops: the difference between resilient and flaky systems

1. Stale element recovery

2. Unexpected modal handling

3. Navigation divergence

4. Partial form failures

Selector design tradeoffs

Priority order I recommend

Why not just use text selectors everywhere?

Why not just use XPath from replay?

A better pattern: store selector bundles

State tracking tradeoffs

Evaluation metrics that reflect real production reliability

Practical Playwright patterns for production

Use locator APIs, but do not outsource reasoning to them

Capture evidence on every failure branch

Distinguish timeout classes

Putting it together: a deterministic step executor

Production considerations for shopping, booking, and extraction systems

Shopping flows

Booking flows

Data extraction flows

Takeaways