Synthetic Web Factory for Browser Agent Training: Procedural Websites, Task DSL, and Reward Simulators

Browser agents are colliding with an inconvenient truth: the public web is a terrible training ground. It is dynamic, brittle, and adversarial; it hides affordances behind infinite scrolls, analytics beacons, paywalls, CAPTCHAs, cookie banners, and personalization. What you need for robust, repeatable learning is not the live web but a web factory: a system that can procedurally generate diverse, deterministic, instrumented websites and pair them with precise task specifications and reward simulators.

This article proposes a concrete blueprint for building such a synthetic web factory. It covers:

A procedural website generator that emits rich, varied, semantically grounded sites.
A task DSL that compiles to ground-truth affordances and subgoals.
Reward simulators that provide dense, unambiguous feedback for RL and evaluation.
Automated trace labeling and deterministic replays for CI, benchmarking, and offline RL.
Engineering practices for determinism, reproducibility, and scale.

The core thesis is straightforward: if you want browser agents that generalize, you must control the data-generating process. The factory approach gives you orthogonal axes of variation, principled supervision signals, and the ability to stress-test skills you actually care about (navigation, extraction, form completion, planning under uncertainty) without the noise of a chaotic web.

Architecture Overview: The Factory Loop

At a high level, the factory is a generator–simulator–evaluator loop:

Sample a seed and configuration that defines site families, data schemas, and UX variations.
Generate a website (assets, routes, components, content, behaviors) plus latent semantics.
Compile a set of task specifications (DSL) against the generated site, producing ground-truth affordances and success criteria.
Launch runs: agent(s) interact with the site in an instrumented browser harness.
Simulate rewards continuously from the compiled task model; auto-label the trace.
Export deterministic replays and artifacts for RL, CI, and reproducibility.

This closed circuit allows you to: (a) tune difficulty and coverage; (b) measure causal effects of UI or content perturbations; and (c) keep evaluation sealed from training data by holding out seeds or site families.

Procedural Website Generation

Design goals

Semantic richness over visual noise: Agents must learn real skills (search, sorting, pagination, CRUD, auth, scheduling, filtering), not click-chasing.
Diversity with control: Vary layout, styling, widget types, copy, error states, latency, and content schemas along explicit axes (not just random CSS).
Determinism with seeds: Every site and run must be reproducible from a seed and a versioned generator.
Instrumentation at the component level: Components carry typed “affordance” metadata that the DSL compiler and reward simulator can resolve to DOM nodes precisely.

Site families and widgets

Instead of monolithic page templates, think in terms of composable widget families with explicit semantics:

Data-backed widgets: tables (sortable, filterable), lists with infinite scroll, detail views, calendars, maps.
Input widgets: text fields with validation, password/2FA, radio/checkbox groups, dropdowns, typeahead, date pickers, file upload, WYSIWYG editors.
Navigation: multi-step wizards, nested menus, breadcrumbs, modal workflows, tabs, accordions.
E-commerce primitives: catalogs, carts, coupons, checkout, payment flows, shipment tracking.
Productivity primitives: kanban boards, task CRUD, due dates, labels, comments, attachments.

Each widget is annotated with machine-readable affordance specs (e.g., "clickable", "fillable(email)", "navigate(route=/orders/[id])"). The generator composes widgets into sitemap graphs with typed edges (e.g., "list->detail", "cart->checkout").

Deterministic content and behavior

Seeded PRNG: One master seed fans out to content, layout, and behavior seeds. Fast-forwardable streams guarantee consistency even if generation order changes.
Content factories: Generate product catalogs, user directories, forum threads, etc., from grammar-based or template-based synthesizers. Ensure referential integrity and invariant checks (e.g., no duplicate SKUs unless testing duplicates explicitly).
Behavior variants: Feature flags select API delay distributions, failure modes, optimistic UI, sync vs async validation, debounce timings, keyboard accessibility, and ARIA labeling.
Offline determinism: Serve assets locally; freeze timezones, locales, font rasterization, and device metrics. Disable nondeterministic animations.

Example: TypeScript site generation skeleton

ts
import seedrandom from "seedrandom";
import { buildSite, Widget, Affordance } from "@synthetic-web/factory";

interface SiteConfig {
  seed: string;
  families: ("catalog" | "account" | "support")[];
  theme: "light" | "dark" | "high-contrast";
  behavior: {
    apiLatencyMs: { mean: number; jitter: number };
    authMode: "password" | "password+otp" | "sso";
    validation: "client" | "server" | "hybrid";
  };
}

export function generateSite(cfg: SiteConfig) {
  const rng = seedrandom(cfg.seed);

  const productCatalog = makeCatalog({
    rng,
    categories: sample(rng, ["Home", "Garden", "Electronics", "Toys"], 3),
    nItems: 200 + Math.floor(rng() * 200)
  });

  const widgets: Widget[] = [
    makeNavbar({ rng }),
    makeProductList({ rng, items: productCatalog, affordances: [
      Affordance.ClickableItem({ by: "title" }),
      Affordance.Filter({ by: ["price", "category"] }),
      Affordance.Sort({ keys: ["price", "rating"] })
    ]}),
    makeCart({ rng }),
    makeCheckout({ rng, authMode: cfg.behavior.authMode })
  ];

  return buildSite({
    rng,
    routes: composeRoutes(widgets),
    theme: cfg.theme,
    behavior: cfg.behavior,
    metadata: { catalog: productCatalog }
  });
}

The point is not the specific API but the pattern: widgets embed typed affordances, generation is seeded, and the final build returns both the deployable site and a metadata graph the rest of the pipeline can consume.

Semantic instrumentation

Affordance tags: Render DOM nodes with stable IDs and data-affordance attributes carrying semantic roles. Example: <button data-affordance="buy" data-item-id="SKU123">Add to cart</button>.
ID stability: Generate content IDs from semantic hashes (e.g., hash("SKU:Red-Mug")) so edits don’t invalidate ground truth.
Accessibility and structure: Use ARIA roles and landmarks; ensure that semantics agree across visual and DOM signals. This helps train agents that leverage both vision and structure.

A Task DSL That Compiles to Ground Truth

Natural language task descriptions are great for humans and LMs, but they are terrible as a primary supervision source: ambiguous, hard to compile, and brittle. A task DSL gives you a crisp, declarative contract that can be compiled against the site’s affordance graph to produce precise subgoals and success conditions.

Design goals

Compositional: Build tasks out of reusable primitives (navigate, click, fill, select, assert, loop).
Referential: Bind operations to semantic targets (entity keys, attributes) instead of brittle CSS selectors.
Compilable: Resolve to concrete witness sets of DOM nodes and state predicates for a particular site instance.
Observable: Emit an evaluable plan graph for reward simulation and trace labeling.

A minimal grammar (sketch)

ebnf
Task       ::= "task" Ident Params? "{" Step+ "}" ;
Step       ::= Action | Branch | Loop | Assert ;
Action     ::= (GoTo | Click | Fill | Select | Submit | Wait) ";" ;
GoTo       ::= "goto" RouteExpr ;
Click      ::= "click" TargetExpr ;
Fill       ::= "fill" TargetExpr "with" ValueExpr ;
Select     ::= "select" TargetExpr "option" ValueExpr ;
Assert     ::= "assert" StateExpr ";" ;
Branch     ::= "if" StateExpr "{" Step+ "}" ("else" "{" Step+ "}")? ;
Loop       ::= "for" Ident "in" QueryExpr "{" Step+ "}" ;

TargetExpr ::= Kind "(" AttrPred ("," AttrPred)* ")" ;
Kind       ::= Ident ;           // e.g., Item, Field, Button, Link
AttrPred   ::= Ident RelOp Value ;
RelOp      ::= "=" | "~=" | ">" | "<" ;
Value      ::= String | Number | Ident ;

Targets refer to semantic objects, not DOM selectors. During compilation, the DSL resolver queries the site’s affordance graph and metadata to produce a witness set: the exact DOM node IDs and state predicates that satisfy the target.

Example task (DSL)

// Goal: Buy the cheapest red mug and ship to Alice at 123 Market St

task buy_red_mug {
  goto route("/shop");
  select Filter(name="color") option "Red";
  select Sort(key="price") option "asc";
  click Item(type="product", title~="Mug");
  click Button(role="add-to-cart");
  goto route("/checkout");
  fill Field(name="full_name") with "Alice Nguyen";
  fill Field(name="address_line1") with "123 Market St";
  fill Field(name="city") with "San Francisco";
  fill Field(name="postal_code") with "94105";
  submit Form(name="shipping");
  assert State(order.status = "placed");
}

Compilation resolves, for a given generated site, the concrete elements: which filter widget is the color filter, which sort dropdown controls price, which item matches a product with a title text that fuzzy-matches "Mug" and has color attribute Red, etc. Importantly, it projects a subgoal graph: each step is a predicate on observable state changes (e.g., the cart now contains SKU=hash("Red Mug, price=min")) and an expected DOM action set.

Compiler outputs

Witness sets: For each TargetExpr, a set of candidate DOM node IDs (often size 1 if the site’s semantics are consistent) and fallback alternatives if multiple nodes qualify.
Subgoal graph: A DAG mapping steps to required predicates on app state, plus ordering constraints when necessary.
Oracle bindings: Pointers to site metadata for fast evaluation (e.g., which SKU is the cheapest red mug under current pricing rules).

Safety and non-cheating

The reward simulator and evaluator can use the compiled oracle to detect invalid shortcuts. For example, if the site exposes a hidden admin link that flips order.status directly, reward must not spike for that path unless the DSL grants permission to use such affordances. The compiler can enforce a policy: only actions linked to witness sets from DSL steps count as progress.

Reward Simulators: Dense, Deterministic, and Aligned

With the task compiler producing witness sets and state predicates, you can implement dense, unambiguous reward functions. Shaped rewards accelerate RL and provide richer diagnostics in evaluation.

Reward design principles

Incremental progress: Partial credit for reaching subgoals (e.g., cart contains the right SKU, address validated) even before final success.
Step penalties: Small per-step negative reward to encourage efficiency and discourage thrashing.
Invalid action penalties: Larger negative reward for actions outside permitted affordances (e.g., clicking behind a modal, filling read-only fields).
Time-aware but deterministic: Penalize wall-clock or simulated time, not system noise; use fixed time units per action or per simulated latency interval.
Idempotence and reversals: Do not double-reward repeated clicks; penalize undoing progress (e.g., removing item from cart after adding when the goal is to purchase).

A reference reward function (sketch)

python
from dataclasses import dataclass

@dataclass
class RewardSpec:
    step_penalty: float = -0.01
    invalid_penalty: float = -0.2
    subgoal_reward: float = 0.2
    success_reward: float = 1.0

class RewardSimulator:
    def __init__(self, compiled_task, spec: RewardSpec):
        self.task = compiled_task  # has subgoals, witness sets, oracle
        self.spec = spec
        self.satisfied = set()

    def step(self, event, app_state):
        r = self.spec.step_penalty

        # Validate action against current allowed affordances
        if not self.task.is_valid_action(event):
            return r + self.spec.invalid_penalty, False

        # Check which subgoals newly satisfied
        newly = [g for g in self.task.subgoals if g.id not in self.satisfied and g.predicate(app_state)]
        if newly:
            for g in newly:
                self.satisfied.add(g.id)
                r += self.spec.subgoal_reward

        done = self.task.is_success(app_state)
        if done:
            r += self.spec.success_reward
        return r, done

This simulator uses the compiled oracle, not heuristics, to evaluate predicates. It is deterministic given a seed and versioned app code.

Consider side-effect-heavy components (e.g., asynchronous validators). The simulator should model the timing explicitly: success of a validation subgoal is contingent on seeing a corresponding state transition (server accepted), not just a UI toast. You can either run the real app logic (preferred, cheaper) or emulate it in the simulator if app code is detached from the GUI.

Avoiding wireheading

If agents can inspect internal oracle state, they can wirehead. Keep the oracle internal to the reward simulator. Observation space (DOM, text, screenshots) must not expose hidden metadata like data-affordance labels that do not exist on real sites unless you intend to train on them. A practical compromise is to maintain two DOMs:

Instrumented DOM: with data-affordance attributes, used only by the simulator and trace labeler, never delivered to the agent.
Agent DOM: produced by stripping instrumentation attributes from the document before rendering to the agent context.

Auto-Label Traces for Supervised and Offline RL

The factory can automatically annotate every interaction with action semantics, subgoal progress, and error diagnostics. This creates high-quality datasets for SFT, behavior cloning, and offline RL without manual labeling.

What to log per event

Observations snapshot: compact DOM serialization (text, structure, visibility), screenshot hash, and derived features (e.g., accessibility tree).
Action: discrete schema (click, type, select, submit, keypress) with target DOM ID and parameters.
Validity: whether the action matched a permitted affordance (from compiled witness set) at that step.
Reward: immediate reward, cumulative reward, and the subgoals satisfied by this event.
App deltas: selected state changes (e.g., cart, auth state, route), not the full store.
Timing: logical time step, simulated latency applied.

Trace schema example (JSON)

json
{
  "run_id": "seed-42/buy_red_mug/agent-v3",
  "site_build": {
    "factory_version": "0.9.1",
    "seed": "42",
    "hash": "sha256:..."
  },
  "task": {
    "name": "buy_red_mug",
    "dsl_version": "1.2.0",
    "compiled_hash": "sha256:..."
  },
  "events": [
    {
      "t": 0,
      "obs": { "dom": "<html>...", "screen_hash": "..." },
      "action": { "type": "goto", "route": "/shop" },
      "valid": true,
      "reward": { "r": -0.01, "R": -0.01, "subgoals": [] },
      "deltas": { "route": "/shop" }
    },
    {
      "t": 1,
      "obs": { "dom": "<html>..." },
      "action": { "type": "select", "target": "filter-color", "value": "Red" },
      "valid": true,
      "reward": { "r": 0.19, "R": 0.18, "subgoals": ["filtered_by_color"] },
      "deltas": { "filters": { "color": "Red" } }
    }
  ]
}

Because the task compiler knows subgoal predicates, it can label whether a given action contributed to a predicate becoming true. If multiple actions were required (e.g., fill three fields before submit), only the final predicate satisfaction event will be credited, yet all contributing actions can be marked as prerequisite progress.

Evaluation from traces

Accuracy: success rate at task completion.
Efficiency: steps to success, cumulative reward, unnecessary detours.
Robustness: performance under perturbations (layout shuffles, copy paraphrases, latency changes) with same DSL.
Attribution: which subgoals fail most often; which widgets are bottlenecks; what error messages the agent ignored.

Deterministic Replays and CI Integration

Reproducibility is non-negotiable. Your factory should export hermetic artifacts for each run so you can:

Reproduce a failure locally with one command.
Commit a replay to your CI, gating merges on non-regression.
Package offline datasets for multiple learners and metrics pipelines.

Hermetic artifact bundle

Site image: a Docker image or tarball with exact static assets, API handlers, locale/timezone, font set, and browser version pins.
Seed and generator manifest: the factory version, seeds, flags, and build hash.
Task bundle: DSL source, compiled witness sets, and hash of the compiler version.
Trace: full interaction log and seed for random policy (if using exploration).
Replay script: a deterministic harness that consumes the above and replays actions to verify results.

Replay harness tips

Freeze time: override Date.now/Performance.now with a deterministic clock that advances per action and per simulated latency.
Seed randomness: patch Math.random, crypto.getRandomValues in the agent DOM.
Disable nondeterminism: turn off CSS animations, requestAnimationFrame loops, and GPU rasterization differences if possible.
Network determinism: stub all network calls with recorded responses in the site image. Favor service worker-based recording to ensure coverage.

Example: Playwright-based replay harness (TypeScript)

ts
import { chromium } from "playwright";
import { loadBundle } from "@synthetic-web/replay";

(async () => {
  const bundle = await loadBundle(process.argv[2]);
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    locale: bundle.locale,
    timezoneId: bundle.timezone,
    viewport: { width: 1280, height: 800 }
  });
  const page = await context.newPage();
  await bundle.prepare(page); // installs time/PRNG shims, service worker, routes

  for (const e of bundle.trace.events) {
    await bundle.applyAction(page, e.action);
    await bundle.assertState(page, e.expected_state_after);
  }

  await browser.close();
})();

With such a harness, CI can re-run thousands of tasks across agents nightly, producing stable trendlines and catching regressions.

Observation and Action Spaces for RL

If you intend to train RL policies directly, wrap the factory in a standard environment API (e.g., Gymnasium-like) with well-defined observations and actions.

Observation options

DOM tokens: a graph or sequence of nodes with attributes (tag, role, visible text), bounding boxes, and a pointer to focus/hover states. Pros: compact and semantic. Cons: modeling layout can be tricky.
Vision: screenshots (full or cropped) plus optional auxiliary channels (accessibility heatmaps). Pros: aligns with multimodal LMs. Cons: heavy.
Hybrid: a structured DOM serialization plus low-res screenshot for spatial hints.

Action schema

Discrete intents: click(target_id), type(target_id, text), select(target_id, option), submit(form_id), keypress(key), scroll(x, y), goto(route|url).
Pointer-level actions: x/y clicks, drag-and-drop. Needed if training pure visual agents; otherwise, typed affordance-level actions accelerate learning.

Gym-style wrapper (Python sketch)

python
class WebTaskEnv:
    def __init__(self, site_bundle, task_bundle, obs_mode="dom+image"):
        self.site = site_bundle
        self.task = task_bundle
        self.sim = RewardSimulator(task_bundle.compiled, RewardSpec())
        self.browser = None

    def reset(self, seed=None):
        self.browser = launch_browser(self.site, seed)
        obs = self._observe()
        self.sim.reset()
        return obs, {}

    def step(self, action):
        evt = self._apply_action(action)
        obs = self._observe()
        r, done = self.sim.step(evt, self._app_state())
        info = { "valid": self.task.is_valid_action(evt) }
        return obs, r, done, False, info

You can expose curriculum knobs in reset: sample tasks from easier families first (simple forms) and advance to complex ones (multi-step wizards, ambiguous copy) as policies improve.

Variation, Domain Randomization, and Generalization

The advantage of a factory is orthogonal controllability. You should encode variation axes and schedule them deliberately.

Recommended variation axes

Visual reskins: typography, spacing, color palettes, iconography, dark mode, right-to-left.
Layout: sidebar vs topbar navigation, tab orders, dense vs spacious lists.
Copy: synonyms and paraphrases for button texts and labels; add or remove descriptive helper text.
Widget choices: dropdown vs radio group; modal vs inline form; infinite scroll vs pagination.
Behavior: latency distributions, error injection (5xx, 4xx), optimistic UI toggles, retry behaviors.
Accessibility and hints: presence of ARIA labels, tooltips, keyboard shortcuts.

Generalization tests

Counterfactuals: Hold out one axis (e.g., switch dropdowns to radio groups) and evaluate zero-shot transfer.
Hidden families: Keep a site family (e.g., travel booking) completely unseen during training.
Copy-only changes: Swap all user-facing strings with paraphrases while holding structure fixed.
Stress tests: Add distractor widgets that look similar but are irrelevant (to test affordance grounding).

Anti-Ghosting: Prevent Agents From Exploiting Synthetic Artifacts

A common criticism of synthetic environments is overfitting to artifacts. Mitigate it by:

Stripping data-affordance attributes from the agent DOM. Keep semantics only in the simulator DOM.
Randomizing non-semantic IDs and orderings while keeping witness sets stable.
Mixing in real web evaluation (e.g., curated static mirrors) downstream to detect sim-to-real gaps.
Varying textual hints and introducing realistic noise: typos, abbreviations, inconsistent capitalization.
Limiting observation of URLs or routes if they trivially leak answers.

Implementation Details That Matter in Practice

DevTools integration: Use Chrome DevTools Protocol (CDP) to capture network, performance, and DOM snapshots; to intercept input events; and to inject deterministic shims.
Fast resets: Avoid full browser restarts between episodes. Reset app state via API calls and local storage/session storage clearing.
Memory and CPU: Screenshot-heavy observation pipelines can dominate cost. Cache and delta-compress; consider low-res with super-resolution decoders at policy side.
Security: Run sites in isolated containers, block external network egress, sanitize any LLM-generated content, and avoid remote code execution vulnerabilities in widget generators.
Versioning: Hash every generator and compiler artifact. A task should carry a tuple (factory_version, seed, compiler_version, task_hash) for exact identity.
Testability: Unit test each widget’s affordance tags; fuzz form validators; property-test the compiler (e.g., for any generated site, compiled witness sets resolve to at least one valid DOM node for each step).

Worked Example: Travel Booking Site Family

Let’s walk through a small, focused family: a travel booking site with flights and hotels.

Widgets: search forms (origin, destination, date pickers), results lists (sortable by price, duration), detail pages with fare rules, cart/itinerary, checkout with passenger info and payment.
Variations: one-way vs round-trip; calendar popovers vs inline date pickers; filters by airline, stops; “best” vs “cheapest” ranking; 2FA on checkout.

Example DSL task: "Book the cheapest round-trip from SFO to JFK departing next Monday, returning Friday, for one adult."

task book_flight_roundtrip {
  goto route("/flights");
  select Toggle(name="roundtrip") option true;
  fill Field(name="origin") with "SFO";
  fill Field(name="destination") with "JFK";
  fill DatePicker(name="depart_date") with next("monday");
  fill DatePicker(name="return_date") with next("friday");
  submit Form(name="flight_search");
  select Sort(key="price") option "asc";
  click Result(type="flight", rank=1);
  click Button(role="add-to-itinerary");
  goto route("/checkout");
  fill Field(name="first_name") with "Sam";
  fill Field(name="last_name") with "Lee";
  fill Field(name="email") with "sam.lee@example.com";
  submit Form(name="passenger");
  assert State(itinerary.status = "booked");
}

Compilation uses the site’s flight inventory and pricing engine (seeded). The reward simulator awards subgoals for: performing search with valid IATA codes and dates, sorting ascending by price, selecting rank-1 flight, adding to itinerary, and final booking success.

Edge cases you can emit in generation:

Airport ambiguity (e.g., New York area with JFK/LGA/EWR) to test disambiguation.
Validation errors when return date precedes departure.
API failures at confirmation step to test retry/robustness.

Data Packaging and Interop

A healthy ecosystem needs interoperable bundles and plain schemas. Recommended:

Site bundle: OCI image reference plus manifest of routes, locales, and behavior flags.
Task bundle: DSL files, compiled witness sets (protobuf/JSON), and compiler version.
Trace bundle: newline-delimited JSON events with schema version and checksums.
Replay harness CLI: swf replay --bundle <path> --verify.
Leaderboard protocol: standardized metrics JSON per run (success_rate, avg_steps, reward, robustness indices) to compare agents apples-to-apples.

What About Training on the Real Web?

Train on both, evaluate on both—but start with synthetic. Synthetic gives you:

Precisely labeled subgoals that unlock credit assignment.
Control over covariates to measure generalization.
Safety and compliance: no scraping TOS issues, no PII, no CAPTCHAs.

Complement with curated real-web tasks (mirrored or permissioned) to measure sim-to-real gaps. As agents mature, increase real-web exposure—but keep the factory as your regression bedrock.

Common Pitfalls and How to Avoid Them

Overfitting to instrumentation: If agents can see data-affordance attributes or predictable IDs, they will exploit them. Strip or randomize in the agent DOM.
Insufficient variation: Random colors do not equal diversity. Vary semantics and workflows.
DSL drift: If the DSL grows ad hoc, compilers become brittle. Maintain a versioned spec and strict compatibility tests.
Reward hacking: Ensure rewards are a function of task predicates, not incidental UI state (e.g., toasts, local storage flags).
Flaky determinism: Fonts, timezone, and GPU differences cause rendering drift. Pin browser versions, pack fonts, freeze locales, and prefer software rendering in CI.

Roadmap: From Factory to Community Standard

To unlock cross-lab progress, we should converge on:

An open Task DSL spec with witness set semantics.
A common trace schema and deterministic replay API.
A curated set of site families and seeds for public leaderboards.
Bridges to popular frameworks (Playwright, Puppeteer, Selenium, WebDriver BiDi) and RL libraries (Gymnasium, Ray RLlib, CleanRL).

The synthetic web factory is not a toy. It is the equivalent of a wind tunnel for browser intelligence: a controlled, instrumented environment where you can learn the fundamentals before flying into the storm of the open web.

Appendix: Practical Checklists

Factory readiness checklist

Seeded generation produces identical sites across runs and machines.
Affordance metadata exists for all interactive widgets and maps to unique DOM nodes.
Compiler resolves all task targets to non-empty witness sets.
Reward simulator passes unit tests for typical and adversarial traces.
Replay harness can verify traces headlessly in CI.
Variation knobs are documented and covered by tests.

Metrics beyond success rate

Coverage: fraction of widgets and affordance types exercised per run.
Error taxonomy: distribution of validation vs navigation vs extraction errors.
Resilience: degradation under injected latency and failure.
Calibration: agent’s self-reported confidence vs actual success.

Selected references and related work

MiniWoB++: Shi et al., "MiniWoB++: A Benchmark for RL on the Web" (https://arxiv.org/abs/1711.09667)
WebShop: Yao et al., "WebShop: Towards Scalable Web Interaction with Grounded Language Agents" (https://arxiv.org/abs/2207.01206)
Mind2Web: Deng et al., "Mind2Web: Towards a Generalist Agent for the Web" (https://arxiv.org/abs/2306.06070)
WebArena: Zhou et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents" (https://arxiv.org/abs/2307.13854)
BrowserGym and BrowserEnv initiatives (various, see e.g., https://browsergym.org/)
Playwright (https://playwright.dev/) and Chrome DevTools Protocol (https://chromedevtools.github.io/devtools-protocol/)

Opinionated closing note

If your browser agent pipeline does not include: (1) a task DSL that compiles to affordances; (2) a reward simulator grounded in that DSL; and (3) deterministic replays, you are doing evaluation by vibes. Build the factory. Train on it. Then ship to the real web with confidence that your fundamentals are sound.