Curriculum Learning for Browser Agents: Mining Repairable Failures from Real DOM Traces to Improve Step Reliability

Most browser agents do not fail because the task is fundamentally impossible. They fail because one step lands on the wrong node, at the wrong time, with the wrong assumptions.

In production, that usually looks boring rather than dramatic:

a click targets an element that was valid 200 ms ago but has been re-rendered
a button is “visible” by DOM rules but is covered by a sticky header
a list item exists in the accessibility tree but is not mounted because the container is virtualized
the agent finds the right label text, but the underlying node is stale after hydration
the action succeeds mechanically, but the intended postcondition never happens

If you are building browser agents, generic benchmark success rates are not enough. What matters is step reliability under realistic UI failure modes. The fastest path to improvement is not inventing bigger planning loops. It is mining your own real execution traces, extracting repairable failures, and turning them into supervised training data and runtime recovery policies.

This article is about how to do that in practice.

I’ll focus on an implementation pattern that has worked well in browser automation systems:

collect step-level traces from real runs
annotate intent, preconditions, action targets, and postconditions
classify repairable failure modes
generate a curriculum from easy clean interactions to noisy multi-step flows
train or tune a policy that selects actions and repairs
run a step executor that validates postconditions and invokes targeted recoveries instead of full replans

The theme here is simple: treat failures as data, not just incidents.

A Real Failure: “Click Submit” Works in Replay, Fails in Production

Let’s start with a representative trace.

The task is mundane: sign in, fill a form, click Submit.

The agent produced this action:

json
{
  "step_id": 18,
  "intent": "submit form",
  "action": {
    "type": "click",
    "selector_strategy": "text+role",
    "selector": {"role": "button", "name": "Submit"}
  }
}

In local replay, this often works. In CI and headless production, it flakes.

Playwright error:

text
TimeoutError: locator.click: Timeout 5000ms exceeded.
Call log:
  - waiting for getByRole('button', { name: 'Submit' })
    - locator resolved to <button class="btn btn-primary">Submit</button>
  - attempting click action
    - waiting for element to be visible, enabled and stable
    - element is visible, enabled and stable
    - scrolling into view if needed
    - done scrolling
    - <div class="toast-container">…</div> intercepts pointer events
  - retrying click action
    - waiting 20ms
    - <div class="toast-container">…</div> intercepts pointer events
  - retrying click action
    - waiting 100ms
    - <header class="sticky">…</header> intercepts pointer events

The screenshot shows a sticky header and a transient toast. The DOM snapshot says the button exists and is visible. The accessibility tree says it is an enabled button named Submit. The node is correct.

But the action still fails.

Root cause

The failure is not target selection. It is interaction invalidation at execution time. The chosen node is semantically correct, yet physically unclickable in the current viewport.

If your agent training data only records successful steps, you never learn this distinction. You teach the model that locating a good node is enough. In real browser environments, it isn’t.

The repair is usually small:

wait for toast dismissal
scroll to center rather than nearest edge
verify clickable point with hit-testing
retry after layout stabilizes
if covered by persistent chrome, use keyboard navigation or form submit fallback

This is exactly the kind of failure you should mine and label.

Why Naive Approaches Fail

There are a few common designs that look reasonable in demos and degrade badly in production.

1. One-shot planning with weak execution verification

A lot of agents do this:

parse page
ask model for next action
run action
continue if no hard exception

That is too weak. Many browser failures are silent semantic failures:

typing went into the wrong field after focus drift
click hit an overlay
select option opened a dropdown but did not commit choice
route changed but form state did not update

If you do not verify postconditions after every step, you accumulate latent errors until the task is irrecoverable.

2. Full replan on any exception

Another common design: any failed click triggers a full “re-read the page and plan again” loop.

That works for some cases, but it is expensive and often unstable. Many failures are local and repairable:

detached node after React re-render
iframe not switched
virtualized row not mounted
stale text after async refresh
hydration race before listeners attach

A full replan can actually make things worse by changing context and introducing new mistakes.

3. Training on synthetic clean trajectories

Synthetic datasets tend to be unrealistically clean:

static DOM n- no overlays
no sticky chrome
no hydration gaps
no network slowness
no virtualization
no nested browsing contexts

Agents trained on these traces learn idealized web interaction, not browser reality.

4. Over-indexing on screenshot-only policies

Vision matters, but screenshot-only control struggles with repair classes that are obvious in DOM/network/accessibility context:

element detached between candidate selection and action
text changed because stale server response was reconciled
button present but aria-disabled=true during pending mutation
row exists semantically but is not mounted due to virtualization window
target is inside cross-origin iframe

For robust step execution, you want multi-view state: DOM, AX tree, viewport geometry, and event/network timing.

The Architecture: Failure Mining as a Training and Runtime Primitive

The system I recommend has two loops sharing the same trace format.

Offline loop

instrument browser runs
collect rich step traces
detect and cluster failures
label repairable examples
produce curriculum datasets
train models or heuristics for candidate scoring and recovery policy selection
evaluate offline against failure-heavy trace sets

Online runtime loop

incrementally parse current page state
select candidate elements using structural + semantic features
execute action with guardrails
verify postconditions
if failure is repairable, invoke targeted recovery
only escalate to replan when local recovery budget is exhausted

The key is that both loops speak the same language: step intent, preconditions, target candidates, action attempt, observed outcome, failure class, repair.

What to Collect in a Step-Level Trace

A useful trace is not just “before screenshot” and “after screenshot.” It needs enough context to replay selection and diagnose why the step failed.

For each step, capture:

task metadata
page/frame URL and origin
DOM snapshot around candidate targets
accessibility tree slice
viewport geometry and scroll offsets
screenshots with bounding boxes
network activity and pending requests
console logs and page errors
action intent and natural-language instruction
candidate element set and ranking features
action execution logs
postcondition checks and outcome

A practical schema might look like this:

json
{
  "trace_id": "run_2026_03_18_114233",
  "step_id": 18,
  "timestamp": 1710764551.123,
  "task": {
    "goal": "Submit reimbursement form",
    "current_subgoal": "Click submit button"
  },
  "page": {
    "url": "https://app.example.com/expense/new",
    "title": "New Expense",
    "viewport": {"width": 1440, "height": 900, "scrollX": 0, "scrollY": 1180},
    "main_frame_id": "frame_main",
    "active_frame_id": "frame_main"
  },
  "dom": {
    "snapshot_ref": "dom_step18.bin",
    "candidate_node_ids": [4412, 8831, 5520]
  },
  "ax": {
    "snapshot_ref": "ax_step18.json",
    "candidate_ax_ids": [912, 1440]
  },
  "visual": {
    "screenshot_ref": "step18.png",
    "candidate_boxes": [
      {"node_id": 4412, "x": 991, "y": 732, "w": 128, "h": 36}
    ]
  },
  "network": {
    "inflight_requests": 2,
    "recent": [
      {"url": "/api/toasts", "status": 200, "ts": 1710764550.992},
      {"url": "/api/form/validate", "status": 202, "ts": 1710764551.011}
    ]
  },
  "intent": {
    "action_type": "click",
    "semantic_target": "submit form",
    "constraints": ["must submit current expense form"]
  },
  "selection": {
    "strategy": "candidate_ranker_v4",
    "top_candidates": [
      {
        "node_id": 4412,
        "score": 0.91,
        "features": {
          "role": "button",
          "name": "Submit",
          "text_similarity": 0.98,
          "clickable": true,
          "centerpoint_visible": false,
          "z_intersection_risk": 0.73
        }
      }
    ]
  },
  "execution": {
    "attempt": 1,
    "playwright_call": "getByRole('button', { name: 'Submit' }).click()",
    "error": "pointer_intercepted"
  },
  "postcondition": {
    "expected": ["route changes or success toast appears or form enters submitted state"],
    "observed": []
  }
}

This looks heavyweight, and it is. But you do not need to store full-fidelity blobs forever. Later I’ll cover compression and retention.

Instrumenting Playwright to Capture the Right State

Below is a Python-oriented trace collector using Playwright. The point is not that this exact code is enough, but that your runtime should record browser-native evidence, not just model prompts.

python
import asyncio
import json
import time
from pathlib import Path
from typing import Any, Dict, List
from playwright.async_api import async_playwright, Page, Frame, Error as PlaywrightError

TRACE_DIR = Path("./traces")
TRACE_DIR.mkdir(exist_ok=True)

JS_DOM_SNAPSHOT = r'''
() => {
  function nodeToObj(node, depth = 0) {
    if (!node || depth > 5) return null;
    const rect = node.getBoundingClientRect ? node.getBoundingClientRect() : null;
    const style = window.getComputedStyle ? getComputedStyle(node) : null;
    return {
      tag: node.tagName || null,
      id: node.id || null,
      classes: node.className || null,
      text: (node.innerText || node.textContent || '').slice(0, 500),
      role: node.getAttribute ? node.getAttribute('role') : null,
      name: node.getAttribute ? (node.getAttribute('aria-label') || node.getAttribute('name')) : null,
      disabled: node.disabled || (node.getAttribute && node.getAttribute('aria-disabled') === 'true') || false,
      href: node.href || null,
      rect: rect ? {x: rect.x, y: rect.y, width: rect.width, height: rect.height} : null,
      visible: !!(rect && rect.width > 0 && rect.height > 0),
      pointerEvents: style ? style.pointerEvents : null,
      zIndex: style ? style.zIndex : null,
      children: Array.from(node.children || []).slice(0, 20).map(c => nodeToObj(c, depth + 1))
    };
  }

  const root = document.body;
  return {
    url: location.href,
    title: document.title,
    viewport: {
      width: window.innerWidth,
      height: window.innerHeight,
      scrollX: window.scrollX,
      scrollY: window.scrollY
    },
    activeElement: document.activeElement ? {
      tag: document.activeElement.tagName,
      id: document.activeElement.id,
      text: (document.activeElement.innerText || '').slice(0, 200)
    } : null,
    body: nodeToObj(root)
  };
}
'''

JS_HIT_TEST = r'''
(el) => {
  const rect = el.getBoundingClientRect();
  const cx = rect.left + rect.width / 2;
  const cy = rect.top + rect.height / 2;
  const topEl = document.elementFromPoint(cx, cy);
  return {
    rect: {x: rect.x, y: rect.y, width: rect.width, height: rect.height},
    center: {x: cx, y: cy},
    topTag: topEl ? topEl.tagName : null,
    topId: topEl ? topEl.id : null,
    topClass: topEl ? topEl.className : null,
    isTargetOrDescendant: topEl ? (topEl === el || el.contains(topEl)) : false
  };
}
'''

class StepTracer:
    def __init__(self, run_id: str):
        self.run_id = run_id
        self.events: List[Dict[str, Any]] = []

    async def snapshot_page(self, page: Page, step_id: int) -> Dict[str, Any]:
        ts = time.time()
        dom = await page.evaluate(JS_DOM_SNAPSHOT)
        screenshot_path = TRACE_DIR / f"{self.run_id}_step{step_id}.png"
        await page.screenshot(path=str(screenshot_path), full_page=False)

        event = {
            "run_id": self.run_id,
            "step_id": step_id,
            "ts": ts,
            "page": dom,
            "screenshot": str(screenshot_path)
        }
        self.events.append(event)
        return event

    def write(self):
        out = TRACE_DIR / f"{self.run_id}.json"
        out.write_text(json.dumps(self.events, indent=2))

async def robust_click(page: Page, locator, tracer: StepTracer, step_id: int):
    await tracer.snapshot_page(page, step_id)
    try:
        await locator.scroll_into_view_if_needed()
        await locator.click(timeout=5000)
        return {"ok": True}
    except PlaywrightError as e:
        return {"ok": False, "error": str(e)}

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        page = await browser.new_page(viewport={"width": 1440, "height": 900})
        tracer = StepTracer(run_id="demo_run")
        await page.goto("https://example.com")
        result = await robust_click(page, page.get_by_role("button", name="Submit"), tracer, 1)
        print(result)
        tracer.write()
        await browser.close()

if __name__ == "__main__":
    asyncio.run(main())

For real systems, extend this to:

capture frame tree and per-frame snapshots
log request/response metadata
hook console/pageerror
capture accessibility tree where possible
record action retries and candidate selection metadata
save DOM deltas rather than full DOM each step

JavaScript network and error hooks

javascript
const trace = { requests: [], console: [], pageErrors: [] };

page.on('request', req => {
  trace.requests.push({
    t: Date.now(),
    kind: 'request',
    url: req.url(),
    method: req.method(),
    resourceType: req.resourceType()
  });
});

page.on('response', async res => {
  trace.requests.push({
    t: Date.now(),
    kind: 'response',
    url: res.url(),
    status: res.status()
  });
});

page.on('console', msg => {
  trace.console.push({
    t: Date.now(),
    type: msg.type(),
    text: msg.text()
  });
});

page.on('pageerror', err => {
  trace.pageErrors.push({
    t: Date.now(),
    message: err.message,
    stack: err.stack
  });
});

These hooks matter because many “DOM failures” are really timing failures visible in network and console events.

Labeling Action Intents and Preconditions

A repair dataset is much more useful if it represents what the step was trying to achieve, not only what method was called.

For each step, label at least:

intent: click primary submit, open menu, choose list item, focus field, enter text, confirm modal, switch tab
target semantics: submit form, search products, choose shipping option
preconditions: visible, attached, enabled, in correct frame, focusable, text stable, option mounted
postconditions: route changed, modal closed, field value set, list expanded, toast appeared, DOM state updated

This is the bridge between brittle selector-level replay and a generalizable policy.

Example labeled step:

json
{
  "intent": "select_option",
  "target_semantics": "choose 'United States' in billing country dropdown",
  "preconditions": [
    "country combobox exists",
    "combobox expanded or expandable",
    "option text available or searchable",
    "target frame active"
  ],
  "postconditions": [
    "combobox value == 'United States'",
    "country-dependent fields revalidated"
  ]
}

You can bootstrap these labels from heuristics plus reviewer tooling:

infer action type from automation call
infer semantic target from nearby text, form labels, ARIA name, instruction span, and task metadata
infer preconditions from target properties and action contracts
infer postconditions from expected state transitions by action type

The goal is not perfect ontology purity. The goal is making failures repairable and learnable.

Failure Taxonomy That Actually Helps Runtime Recovery

A good failure taxonomy should map to distinct recovery behavior. Here are the classes worth tracking.

1. Detached node

Symptoms

Playwright: element is not attached to the DOM
action target existed during ranking, disappeared during execution
often after React/Vue rerender, optimistic update, route transition

Recovery

re-resolve candidate from stable anchors
avoid stale handles; store selector features and semantic signature
retry within same local context before replan

2. Occlusion / pointer interception

Symptoms

intercepting element in call log
elementFromPoint at center is not target
sticky headers, toasts, cookie banners, modals, loading masks

Recovery

center-scroll or offset-scroll
wait for transient overlay disappearance
dismiss known overlays
keyboard submit or Enter if semantically equivalent

3. Hydration race / listeners not attached yet

Symptoms

button visible and enabled but first click does nothing
network and console show bundle load or hydration completion near failure
repeated click after short delay succeeds

Recovery

wait for app idle heuristic, not just DOM loaded
require text and layout stability over a short window
retry with postcondition verification

4. Stale text / semantic drift

Symptoms

matching text is present but no longer refers to intended object
list order changed, labels updated, server-rendered placeholder replaced
agent clicks old “Edit” for wrong row

Recovery

rank by structural anchors, not only text
include nearby key-value context and ancestry features
verify object identity after action

5. Virtualized content

Symptoms

target row known from data/task but not mounted in DOM
AX tree may expose partial semantics, DOM query misses target
scrolling changes DOM membership

Recovery

detect virtual scrollers
scroll/search progressively
use list container semantics and row index/data attributes
verify mount before action

6. Iframe boundary issues

Symptoms

target not found in main frame but visible on screen
clicks appear to hit frame element, not internal target
cross-origin constraints block direct DOM traversal

Recovery

identify frame ownership during candidate selection
switch frame context explicitly
use frame-local selectors and screenshots
maintain frame tree in trace and candidate features

7. Disabled/pending state

Symptoms

aria-disabled, disabled attribute, loading spinners, submit button gated by validation
action mechanically possible via force click but semantically invalid

Recovery

do not force click as default
satisfy missing prerequisites
inspect validation errors and required fields

8. Focus drift / keyboard target mismatch

Symptoms

typed text appears nowhere or in wrong input
modal steals focus
async validation shifts focus

Recovery

verify active element before typing
prefer direct fill on intended control
re-focus and re-check value postcondition

The point of this taxonomy is that each class leads to a targeted repair policy. That is what makes local recovery work.

Building Curriculum Stages from Failure-Mined Data

Curriculum learning is often discussed abstractly. For browser agents, it should be concrete and operational.

You already have trace data. Use it to define stages that increase difficulty along the dimensions your runtime actually struggles with.

Stage 0: Clean single-step interactions

Examples:

click clearly labeled visible button
fill obvious text input
select dropdown option already mounted

Characteristics:

static DOM
no overlays
single frame
no virtualized lists
immediate postcondition

Purpose:

train base candidate selection and postcondition checking

Stage 1: Clean actions with distractors

Examples:

multiple buttons named “Save”
multiple inputs with similar labels
repeated rows and actions in tables

Characteristics:

target requires structural grounding
nearest text is insufficient

Purpose:

train semantic + structural ranking

Stage 2: Timing noise and transient invalidation

Examples:

hydration races
delayed enabling
spinners and toasts
route transitions

Purpose:

train anti-flake timing and local retries

Stage 3: DOM churn and stale targets

Examples:

detached nodes after render
list reorder after filter
text replaced after async load

Purpose:

train semantic re-resolution from stable anchors

Stage 4: Viewport and occlusion complexity

Examples:

sticky headers/footers
responsive layout changes
offscreen targets
nested scroll containers

Purpose:

train geometry-aware execution and hit-test verification

Stage 5: Virtualization and search-in-list behavior

Examples:

lazy rows
infinite scroll tables
combobox options mounted on demand

Purpose:

train scroll/search/mount loops

Stage 6: Iframes and multi-context flows

Examples:

embedded payment forms
auth widgets
editor inside iframe

Purpose:

train frame-aware targeting and action routing

Stage 7: Multi-step flows with compounding local failures

Examples:

checkout
enterprise admin forms
dashboards with tabbed workflows

Purpose:

train bounded local recovery without full task collapse

The key is not just ordering from easy to hard. It is preserving the repair labels so the learner sees both failure state and successful recovery action.

Mining Repairable Failures from Real Traces

You need a pipeline that converts raw runs into supervised examples.

Step 1: Segment runs into attempts

For each action step:

identify intended target and action
collect all retries within a local time window
attach browser logs, DOM delta, and postcondition observations

Step 2: Determine whether the failure was repairable

A failure is repairable if a local change within the same subgoal later succeeded, for example:

re-query same semantic target and click succeeded
same input after refocus succeeded
switched frame and target became interactable
scrolled virtualized container until row mounted, then clicked succeeded

Step 3: Extract failure → repair pairs

Example:

json
{
  "failure_state": {
    "intent": "click submit",
    "failure_class": "occlusion",
    "evidence": {
      "interceptor": "div.toast-container",
      "centerpoint_visible": false,
      "target_role": "button",
      "target_name": "Submit"
    }
  },
  "repair_action": {
    "policy": "wait_overlay_then_center_scroll_then_retry",
    "args": {"max_wait_ms": 1500}
  },
  "outcome": "success"
}

Step 4: Cluster near-duplicates

You do not want a dataset dominated by one app’s same toast overlay repeated 20,000 times.

Cluster on:

failure class
DOM ancestry signature
app/page template
action type
target role/name
repair policy

Then sample for diversity.

Step 5: Build train/validation/test splits by site and template

Avoid leakage. If the same page template appears in train and test, your offline metrics are inflated.

Prefer splits by:

domain/app
page family/template hash
workflow type

This matters more than benchmark ideology. You want evidence the system generalizes to new UI instances, not just repeated pages.

Incremental DOM Parsing at Runtime

A runtime should not rebuild full world state from scratch on every tiny step if the page only changed locally.

Use an incremental parser with invalidation boundaries.

Track:

frame tree
node ids and ancestry
role/name/text summaries
bounding boxes
scroll containers
mutation timestamps
stable signatures for candidate re-resolution

A useful stable signature includes:

role
accessible name
normalized text
ancestor chain of landmarks/forms/sections
sibling labels and nearby headings
data-* attributes when present
frame path

Python sketch for candidate extraction

python
from dataclasses import dataclass
from typing import List, Optional
import re

@dataclass
class Candidate:
    node_id: str
    frame_id: str
    tag: str
    role: Optional[str]
    name: str
    text: str
    bbox: dict
    enabled: bool
    visible: bool
    ancestors: List[str]
    score: float = 0.0


def normalize(s: str) -> str:
    return re.sub(r'\s+', ' ', (s or '').strip().lower())


def score_candidate(intent: str, c: Candidate) -> float:
    score = 0.0
    target = normalize(intent)
    name = normalize(c.name)
    text = normalize(c.text)

    if 'submit' in target and (name == 'submit' or text == 'submit'):
        score += 0.5
    if c.role == 'button' or c.tag == 'BUTTON':
        score += 0.2
    if c.visible:
        score += 0.1
    if c.enabled:
        score += 0.1
    if any(a in ('FORM', 'MAIN', 'SECTION') for a in c.ancestors):
        score += 0.05
    if c.bbox and c.bbox.get('width', 0) > 20 and c.bbox.get('height', 0) > 20:
        score += 0.05
    return score


def rank_candidates(intent: str, candidates: List[Candidate]) -> List[Candidate]:
    for c in candidates:
        c.score = score_candidate(intent, c)
    return sorted(candidates, key=lambda x: x.score, reverse=True)

In production, the score should use more than string matching:

lexical similarity to task and current subgoal
role-action compatibility
geometry features
ancestry landmarks
label/control association
historical success rates for pattern families
hit-test validity
frame confidence
mutation recency penalty

The model can be learned, but the runtime still needs deterministic guardrails.

Verifying Postconditions After Every Action

This is where many agents become reliable or stay flaky.

An action is not successful because Playwright didn’t throw. It is successful because the expected state transition happened.

Postconditions should be specific by action type.

Click examples

modal opened or closed
route changed
accordion expanded
submit triggered network request and form entered pending/submitted state

Fill examples

input value equals expected normalized string
dependent validation state updated
masked input matches canonical value

Select examples

selected option text/value updated on control
dependent region rerendered

A runtime loop:

python
async def execute_step_with_verification(step, page, selector_engine, recoveries):
    candidates = await selector_engine.find_candidates(page, step.intent)
    target = candidates[0] if candidates else None
    if not target:
        return {"status": "failed", "reason": "no_candidate"}

    result = await perform_action(page, step, target)
    if result["status"] == "ok":
        ok = await verify_postcondition(page, step.postconditions)
        if ok:
            return {"status": "ok", "used_recovery": False}
        result = {"status": "failed", "reason": "postcondition_not_met"}

    failure_class = await classify_failure(page, step, target, result)
    repair = recoveries.choose(failure_class, step, target)
    if not repair:
        return {"status": "failed", "reason": failure_class}

    repaired = await repair.apply(page, step, target)
    if repaired:
        ok = await verify_postcondition(page, step.postconditions)
        if ok:
            return {"status": "ok", "used_recovery": True, "recovery": failure_class}

    return {"status": "failed", "reason": failure_class}

This is the core execution discipline: act, verify, recover locally, verify again, then escalate.

Targeted Recovery Policies Instead of Full Replans

A repair policy should be scoped, cheap, and evidence-driven.

Example recovery table

python
RECOVERY_TABLE = {
    "detached_node": [
        "requery_by_semantic_signature",
        "retry_click"
    ],
    "occlusion": [
        "wait_transient_overlay",
        "scroll_center",
        "hit_test_then_click"
    ],
    "hydration_race": [
        "wait_ui_stable_window",
        "retry_click_with_postcondition_check"
    ],
    "virtualized_content": [
        "identify_scroll_container",
        "progressive_scroll_search",
        "requery_target"
    ],
    "iframe_boundary": [
        "switch_frame_context",
        "requery_in_frame"
    ]
}

JavaScript example: hit-test before click

javascript
async function clickWithHitTest(locator) {
  const handle = await locator.elementHandle();
  if (!handle) throw new Error('missing element handle');

  const hit = await handle.evaluate((el) => {
    const r = el.getBoundingClientRect();
    const x = r.left + r.width / 2;
    const y = r.top + r.height / 2;
    const top = document.elementFromPoint(x, y);
    return {
      width: r.width,
      height: r.height,
      x,
      y,
      ok: !!top && (top === el || el.contains(top)),
      topTag: top?.tagName,
      topClass: top?.className || null,
      topId: top?.id || null,
    };
  });

  if (!hit.ok) {
    await locator.scrollIntoViewIfNeeded();
    await locator.evaluate((el) => {
      el.scrollIntoView({ block: 'center', inline: 'center', behavior: 'instant' });
    });
  }

  await locator.click({ timeout: 3000 });
}

Example: virtualized list recovery

python
async def find_in_virtualized_list(container_locator, text, max_scrolls=20):
    for i in range(max_scrolls):
        item = container_locator.get_by_text(text, exact=True)
        if await item.count() > 0:
            return item.first
        await container_locator.evaluate("el => { el.scrollTop += el.clientHeight * 0.8; }")
        await asyncio.sleep(0.15)
    return None

Example: iframe-aware targeting

python
async def find_button_any_frame(page, name: str):
    for frame in page.frames:
        locator = frame.get_by_role("button", name=name)
        try:
            if await locator.count() > 0:
                return frame, locator.first
        except Exception:
            pass
    return None, None

These policies are not glamorous, but they are exactly what turns step reliability into something you can measure and improve.

Headless Environments and Anti-Flake Timing

Headless is not just headed without pixels. You will see meaningful differences:

font/render timing differences
viewport defaults and responsive breakpoints
animation timing interactions
focus behavior under CI load
slower JS execution under shared runners

A few practical rules help a lot.

1. Use deterministic viewport and user agent

Do not let browser defaults drift across environments.

python
page = await browser.new_page(
    viewport={"width": 1440, "height": 900},
    user_agent="Mozilla/5.0 ... browser-agent-runtime/1.0"
)

2. Prefer stability windows over arbitrary sleeps

Avoid sleep(2) after every action. Instead wait for a short interval where critical signals stay unchanged:

no DOM mutations in target subtree
layout box stable
no overlay at hit point
no pending app-critical request class

Stability probe injected into page

javascript
() => {
  if (window.__agentStableProbeInstalled) return true;
  window.__agentStableProbeInstalled = true;
  window.__agentMutations = 0;
  const obs = new MutationObserver(() => window.__agentMutations++);
  obs.observe(document.documentElement, { childList: true, subtree: true, attributes: true });
  return true;
}

Then poll mutation count over a short window rather than sleeping blindly.

3. Disable or reduce known flaky animations when possible

For internal apps or controlled test environments, inject CSS:

javascript
await page.add_style_tag({
  content: `
    *, *::before, *::after {
      transition-duration: 0s !important;
      animation-duration: 0s !important;
      scroll-behavior: auto !important;
    }
  `
});

Do this carefully; some apps rely on transitions for state timing. But for many enterprise UIs, this reduces false flake.

4. Never default to force click

force=True is useful for diagnostics, not as a primary recovery. It bypasses exactly the evidence you need to know whether the action was semantically valid.

Trace Compression and Storage Strategy

Raw traces get large quickly. If you are collecting step-level DOM, screenshots, AX, and network logs, cost becomes real.

A practical strategy:

Keep full fidelity for:

failed steps
repaired steps
a sampled subset of successful steps
first occurrence of a page/template/version

Store compressed representations for everything else:

DOM delta against previous step
subtree around target and top-k candidates only
screenshot crops plus page-level thumbnail
network summaries instead of bodies
hashed template signatures

Useful compression techniques:

deduplicate repeated DOM subtrees by content hash
store text separately from structure
keep normalized accessibility nodes rather than full protocol dumps
persist selector features instead of all raw attributes

You want enough evidence to train and debug, not a forensic archive of every pixel forever.

Offline Evaluation That Reflects Production Reality

Do not rely on broad benchmark averages as your primary signal. Build an offline suite from your failure-mined traces.

Evaluate at three levels.

1. Candidate selection quality

Given trace state and intent:

is the intended target in top-k?
rank position of correct target
failure-class-specific recall

2. Step execution quality

Given target and page state:

postcondition success rate
retries per successful step
false success rate where no exception occurred but postcondition failed
mean time to recover

3. Recovery quality

Condition on repairable failures:

repair success by failure class
repair latency
unnecessary full replan rate
degradation when multiple failure types co-occur

A good dashboard slices by:

app/template
action type
failure class
headed vs headless
browser version
network condition profile

That gives you a real engineering loop. You can answer questions like:

Did the new ranker improve stale-text rows but hurt iframe targeting?
Did anti-occlusion logic reduce pointer interception without increasing latency too much?
Are hydration race recoveries helping only on React pages and not server-rendered flows?

Using Failure-Mined Data to Improve Robustness

There are several ways to use the dataset, depending on your stack.

1. Train a candidate ranker

Input features:

instruction embedding / lexical features
role/name/text features
ancestry and landmark features
geometry and visibility features
mutation recency
frame identity
hit-test and occlusion risk

Label:

clicked-and-verified target is positive
confusable candidates are hard negatives

This alone improves a lot of step reliability.

2. Train a failure classifier

Input:

execution error text
DOM features of target
recent network/console state
viewport geometry
postcondition observation

Output:

detached_node
occlusion
hydration_race
stale_text
virtualized_content
iframe_boundary
disabled_pending
focus_drift

A decent classifier lets you choose recovery policies much more effectively than generic retry loops.

3. Fine-tune a repair policy selector

Input:

failure state
top candidate metadata
browser logs
prior retries

Output:

wait_and_retry
requery_same_signature
switch_frame
progressive_scroll_search
dismiss_overlay
escalate_to_replan

This can be a learned policy or a rules-first policy with learned ranking.

4. Improve prompts if you are using an LLM in the loop

Failure-mined examples are excellent few-shot material because they encode:

local context
failure evidence
minimal repair
successful outcome

That is far more useful than generic benchmark tasks because it matches your runtime, your sites, and your observed failure modes.

A Production-Oriented Step Executor

Here is a more complete sketch of how the runtime can be structured.

python
class StepExecutor:
    def __init__(self, selector_engine, verifier, classifier, recovery_manager, tracer):
        self.selector_engine = selector_engine
        self.verifier = verifier
        self.classifier = classifier
        self.recovery_manager = recovery_manager
        self.tracer = tracer

    async def run_step(self, page, step, max_local_repairs=2):
        await self.tracer.record_pre_state(page, step)

        candidates = await self.selector_engine.find_candidates(page, step)
        if not candidates:
            return await self._fail(page, step, "no_candidate")

        target = candidates[0]
        action_result = await self._perform(page, step, target)
        verified = await self.verifier.check(page, step)

        if action_result["ok"] and verified:
            await self.tracer.record_success(page, step, target, candidates)
            return {"status": "ok", "repairs": 0}

        repairs = 0
        last_reason = None
        while repairs < max_local_repairs:
            failure = await self.classifier.classify(page, step, target, action_result, verified)
            last_reason = failure
            await self.tracer.record_failure(page, step, target, failure)

            policy = self.recovery_manager.choose(failure, step, target, candidates)
            if not policy:
                break

            changed = await policy.apply(page, step, target)
            repairs += 1
            if not changed:
                break

            candidates = await self.selector_engine.find_candidates(page, step)
            if not candidates:
                break
            target = candidates[0]

            action_result = await self._perform(page, step, target)
            verified = await self.verifier.check(page, step)
            if action_result["ok"] and verified:
                await self.tracer.record_recovery_success(page, step, target, failure, policy.name)
                return {"status": "ok", "repairs": repairs, "recovery": policy.name}

        return await self._fail(page, step, last_reason or "unknown_failure")

    async def _perform(self, page, step, target):
        try:
            if step["action_type"] == "click":
                await target.locator.click(timeout=3000)
            elif step["action_type"] == "fill":
                await target.locator.fill(step["value"], timeout=3000)
            else:
                raise ValueError(f"unsupported action {step['action_type']}")
            return {"ok": True}
        except Exception as e:
            return {"ok": False, "error": str(e)}

    async def _fail(self, page, step, reason):
        await self.tracer.record_terminal_failure(page, step, reason)
        return {"status": "failed", "reason": reason}

This is intentionally conservative. It keeps repairs local and bounded. That tends to outperform broad replanning in noisy production UIs.

Practical Lessons from Real Systems

A few lessons are worth stating plainly.

1. Most improvements come from boring data hygiene

Richer traces, correct postconditions, and honest failure labels usually help more than swapping model architectures.

2. Repairability is the right unit of learning

Not every failure deserves a local fix. But many do. If you can identify the repairable subset well, both runtime stability and training signal improve.

3. Structural context beats raw text in repeated UIs

Tables, settings forms, dashboards, and admin apps are full of repeated labels. Learn and store ancestry, landmarks, and sibling context.

4. Browser-native evidence matters

Playwright call logs, frame trees, hit-tests, and network timing are not incidental debugging artifacts. They are core features for reliability.

5. Silent failures are more dangerous than thrown exceptions

A click that throws is easy to detect. A click that “worked” but changed nothing is what corrupts long workflows.

6. Full replans should be a last resort

If every failure restarts planning, your system becomes expensive, inconsistent, and hard to debug.

Takeaways

If you want browser agents that survive real websites, stop treating failures as noise around success metrics. Failures are the dataset.

The practical recipe is:

Collect step-level traces with DOM, accessibility, viewport, screenshots, frame tree, network, and execution logs.
Label intent, preconditions, and postconditions so examples reflect task semantics rather than only selectors.
Classify failures into repairable modes like detached nodes, occlusion, hydration races, stale text, virtualization, and iframe boundary issues.
Mine failure → repair pairs from real runs and build a curriculum from clean single-step interactions to noisy multi-step workflows.
Parse DOM incrementally at runtime, rank candidates with structural and semantic features, and verify postconditions after every action.
Invoke targeted recovery policies before escalating to full replans.
Evaluate offline on your own failure-heavy traces, not just generic agent benchmarks.

That approach does not make browser automation easy. But it does make it engineerable.

And that is the real shift: from hoping a general agent figures out the web, to building a browser agent system that learns from its own real, repairable mistakes.