WebGPU Agentic Browser: On-Device Auto-Agent AI, UA/Client-Hints Routing, and a Privacy-First Pipeline

Browser-native AI is crossing a threshold. With WebGPU and emerging WebNN support, we can now execute nontrivial models in-page, schedule tools, and orchestrate autonomous agent loops entirely on device—no heavyweight backend required for the common case. Yet if we want reliable performance across the long tail of hardware and network variability, we need a smart runtime that can route between on-device inference and privacy-preserving fallbacks with deterministic replay, all while minimizing data exposure.

This article presents a concrete, technical blueprint for building an auto-agent AI browser experience that:

Runs models on-device via WebGPU/WebNN when feasible.
Selects a routing plan using User-Agent (UA) reduction–era Client Hints along with GPU probes.
Partitions models for hybrid execution where helpful.
Preserves user privacy by design and default.
Uses deterministic logging so agent runs can be faithfully replayed and audited.
Falls back to servers through privacy-preserving channels and content-addressed assets.

The result is a developer-friendly runtime that meets modern constraints: performance without exfiltration, adaptability without fingerprinting, and reliability without brittle heuristics.

Why an Agentic Browser Runtime Now?

WebGPU is broadly available in Chromium-based browsers and other engines, providing GPU compute, portable shader pipelines (WGSL), and predictable memory semantics.
WebNN is emerging as a standard for high-level ML graph execution, enabling vendor-optimized backends with a stable API.
UA string reduction pushes routing logic away from static regexes and toward Client Hints and in-page capability probes, driving better privacy and feature detection.
Edge cases abound: low-memory mobiles, constrained network links, varying GPU capabilities, strict enterprise policies. A runtime must adapt without leaking.

Architecture Overview

Key components:

Capability discovery
- UA Client Hints: platform, model, architecture, bitness, mobile flag.
- In-page GPU probes: feature detection, limits, microbenchmarks.
- Network/environment signals: effective connection type, save-data preference, battery.
Planner and router
- Decide: on-device, split (client/server partition), or server-only.
- Select specific model variants (quantization, size, context length).
- Record the decision with inputs for deterministic replay.
Execution backends
- WebGPU kernels and/or ONNX Runtime Web (WebGPU backend) or WebNN where available.
- WASM SIMD as conservative fallback.
Privacy-first logging and persistence
- Store only necessary state locally.
- No raw DOM or user inputs leave the device by default.
- Deterministic run logs (seed, inputs, decisions) for reproducibility.
Secure server fallback
- Content-addressed model artifacts.
- Oblivious, proxy-mediated transport (e.g., OHTTP) or privacy proxy.
- Server-side deterministic execution to match client replay.
Guardrails and constraints
- Permission prompts for agent actions that modify state.
- Fine-grained tool capabilities, sandboxed execution contexts.
- Strict memory hygiene and no accidental exfiltration.

The rest of this article explores these pieces with code and practical guidance.

1) Capability Discovery: Client Hints and In-Page Probes

UA strings are being reduced; the recommended approach is Client Hints plus on-page APIs.

Server: Request Only the Hints You Need

Avoid asking for every hint—each hint increases fingerprinting surface. Start small and justify each expansion.

Example server response headers:

http
Accept-CH: Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Model, Sec-CH-UA-Arch, Sec-CH-UA-Bitness
Critical-CH: Sec-CH-UA-Platform
Vary: Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Model, Sec-CH-UA-Arch, Sec-CH-UA-Bitness

Notes:

Vary ensures caches distinguish different device classes.
Critical-CH indicates the resource’s processing depends on the hint (avoid misroutes).
Keep the set tight; consider gradual rollout based on observed need.

Client: Read NavigatorUAData

Use high-entropy values selectively, then combine with probes for a nuanced picture.

ts
async function readClientHints() {
  const uaData = navigator.userAgentData;
  const base = uaData ? {
    brands: uaData.brands,
    mobile: uaData.mobile,
    platform: uaData.platform,
  } : {};

  let high = {} as any;
  if (uaData?.getHighEntropyValues) {
    high = await uaData.getHighEntropyValues([
      'architecture', // e.g., arm, x86
      'bitness',      // e.g., 64
      'model',        // model identifier on some platforms
      'platformVersion',
      'fullVersionList',
    ]);
  }

  return { ...base, ...high };
}

Privacy guidance:

Request only values relevant to routing, and only on the first visit where you need them.
Cache minimal, aggregated info (e.g., a short "capability tier" string) rather than raw hints.
Avoid joining hints with other quasi-identifiers unless needed for a single decision.

In-Page GPU Probes via WebGPU

Browser hints won’t tell you compute throughput. A short microbenchmark yields better routing decisions without revealing details to servers.

Feature detection:

ts
function hasWebGPU(): boolean {
  return typeof navigator !== 'undefined' && 'gpu' in navigator;
}

async function getAdapter() {
  if (!hasWebGPU()) return null;
  try {
    const adapter = await navigator.gpu.requestAdapter({ powerPreference: 'high-performance' });
    return adapter;
  } catch {
    return null;
  }
}

Inspect limits and features (do not log vendor/device IDs for privacy):

ts
async function readWebGPULimits() {
  const adapter = await getAdapter();
  if (!adapter) return null;
  const limits = adapter.limits;
  const features = Array.from(adapter.features.values());
  return { limits, features };
}

Run a tiny compute test: multiply two small matrices in WGSL, measure elapsed time. Use conservative sizes to avoid jank; this is a probe, not a benchmark suite.

ts
const wgsl = `
  struct Matrix { data: array<f32>; }
  @group(0) @binding(0) var<storage, read> A: Matrix;
  @group(0) @binding(1) var<storage, read> B: Matrix;
  @group(0) @binding(2) var<storage, read_write> C: Matrix;

  @compute @workgroup_size(8, 8)
  fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
    let N: u32 = 128u; // 128x128 for probe
    if (gid.x >= N || gid.y >= N) { return; }
    var sum: f32 = 0.0;
    for (var k: u32 = 0u; k < N; k = k + 1u) {
      sum = sum + A.data[gid.y * N + k] * B.data[k * N + gid.x];
    }
    C.data[gid.y * N + gid.x] = sum;
  }
`;

async function webgpuProbe() {
  const adapter = await getAdapter();
  if (!adapter) return { supported: false };
  const device = await adapter.requestDevice();

  const N = 128;
  const size = N * N * 4; // f32
  const usage = GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST;

  function makeBuffer(init?: Float32Array) {
    const buffer = device.createBuffer({ size, usage, mappedAtCreation: !!init });
    if (init) new Float32Array(buffer.getMappedRange()).set(init);
    if (init) buffer.unmap();
    return buffer;
  }

  const A = makeBuffer(new Float32Array(N * N).fill(1));
  const B = makeBuffer(new Float32Array(N * N).fill(1));
  const C = makeBuffer();

  const module = device.createShaderModule({ code: wgsl });
  const pipeline = await device.createComputePipelineAsync({
    layout: 'auto',
    compute: { module, entryPoint: 'main' },
  });

  const bind = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: A } },
      { binding: 1, resource: { buffer: B } },
      { binding: 2, resource: { buffer: C } },
    ],
  });

  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bind);
  pass.dispatchWorkgroups(Math.ceil(N / 8), Math.ceil(N / 8));
  pass.end();

  const t0 = performance.now();
  device.queue.submit([encoder.finish()]);
  await device.queue.onSubmittedWorkDone();
  const t1 = performance.now();

  // Cleanup to avoid leaks
  A.destroy(); B.destroy(); C.destroy();

  return {
    supported: true,
    millis: t1 - t0,
    limits: adapter.limits,
    features: Array.from(adapter.features.values()),
  };
}

Interpretation guideline:

Sub-10ms for 128×128 GEMM suggests "fast lane" for medium models.
10–30ms suggests smaller or quantized models.
30ms suggests hybrid or server fallback.

Avoid reporting raw probe numbers to the server. Instead, map to a coarse tier locally and only send the tier if absolutely necessary for fallback selection.

Network and Power Signals

Network Information API: navigator.connection?.effectiveType, downlink, rtt.
Save-Data header and navigator.connection?.saveData indicate user preferences.
Battery status is privacy-sensitive and often restricted; avoid unless the user opts in.

Use these only to select between local model sizes and whether to defer downloads.

2) Planner and Router: Deterministic Yet Adaptive

Your planner evaluates inputs:

UA/Client Hints (platform tiering, mobile/desktop, arch/bitness).
GPU probe tier.
Network conditions and user preferences.
Local cache inventory (which model artifacts are present).

It returns a routing plan that is:

Deterministic for the same inputs and seed.
Privacy-minimal (no raw hints; only derived tier labels retained).

Example planner:

ts
type Plan = {
  mode: 'local' | 'split' | 'remote';
  model: 'tiny-q4' | 'base-q4' | 'base-q8' | 'large-q8';
  backend: 'webgpu' | 'webnn' | 'wasm';
  notes?: string[]; // audit info (
};

function planRoute(inputs: {
  ua: { platform?: string; mobile?: boolean; architecture?: string; bitness?: string };
  gpu: { supported: boolean; millis?: number };
  net: { effectiveType?: string; saveData?: boolean };
  cache: { hasBase: boolean; hasTiny: boolean };
}): Plan {
  const fastGPU = inputs.gpu.supported && (inputs.gpu.millis ?? 999) < 12;
  const okGPU = inputs.gpu.supported && (inputs.gpu.millis ?? 999) < 28;
  const slowNet = ['slow-2g', '2g'].includes(inputs.net.effectiveType || '');

  if (fastGPU && inputs.cache.hasBase && !inputs.ua.mobile) {
    return { mode: 'local', model: 'base-q8', backend: 'webgpu', notes: ['fastGPU'] };
  }

  if (okGPU && inputs.cache.hasTiny) {
    return { mode: 'local', model: 'tiny-q4', backend: 'webgpu', notes: ['okGPU'] };
  }

  if (okGPU && !slowNet) {
    return { mode: 'split', model: 'base-q4', backend: 'webgpu', notes: ['split'] };
  }

  return { mode: 'remote', model: 'tiny-q4', backend: 'wasm', notes: ['fallback-remote'] };
}

Determinism tips:

Normalize inputs (e.g., map platformVersion to a coarse bucket).
Seed any random choices, and include the seed in the run log.
Avoid wall-clock time in decisions; if needed, record it.

3) Execution Backends: WebGPU, WebNN, and WASM Fallback

WebGPU with ONNX Runtime Web

ONNX Runtime Web provides battle-tested kernels with a WebGPU backend. This is the fastest path to production for many models.

ts
import * as ort from 'onnxruntime-web/webgpu';

async function runOnnxWebGPU(modelBytes: ArrayBuffer, inputs: Record<string, ort.Tensor>) {
  await ort.env.webgpu.init();
  const session = await ort.InferenceSession.create(modelBytes, {
    executionProviders: ['webgpu'],
    graphOptimizationLevel: 'all',
  });
  const output = await session.run(inputs);
  return output;
}

Notes:

Use quantized models where possible (int8, int4) to fit memory and bandwidth budgets.
Packages often support prefetching and caching via IndexedDB.

WebNN

If available, WebNN provides a high-level graph API that can map to hardware-specific ML accelerators. Detect and prefer when appropriate:

ts
async function tryWebNN(graphBuilder: (ctx: MLContext) => Promise<MLGraph>) {
  const ml = (navigator as any).ml as ML;
  if (!ml) return null;
  const ctx = await ml.createContext();
  const graph = await graphBuilder(ctx);
  return { ctx, graph };
}

Status varies by platform; feature-detect and be ready to fall back.

WASM SIMD Fallback

For constrained or policy-limited environments, WASM with SIMD can provide deterministic and portable execution, albeit slower.

Use cross-origin isolation only if required (e.g., for SharedArrayBuffer). If you do, set COOP/COEP headers.
Keep memory footprints modest; stream model weights and use tiling.

4) Model Partitioning: Split Execution Without Leaking

Split computing (running early layers on device and later layers on the server) helps when devices can do feature extraction but not heavy heads or long contexts.

Risks:

Intermediate activations can leak information about inputs. Avoid transmitting raw activations by default.

Mitigations (practical today):

Distill models so that the client-side encoder produces a privacy-preserving representation (task-specific, compressed, and lossy).
Add calibrated noise or quantization to activations, trading utility for privacy. Quantization to 8-bit (or 4-bit with dithering) is a simple baseline.
Use transport that hides network metadata (e.g., OHTTP). Avoid joining with user identifiers.

Simple pattern:

Client runs tokenizer + shallow encoder locally.
Client transmits a low-dimension, quantized vector with a per-session ephemeral key.
Server completes the decode and returns only the next token or action plan.

Example encoder-decoder partition using ONNX Runtime Web (client) and server ONNX (or TensorRT) for head:

ts
// Client-side pseudocode
const { encoderOutputs } = await localEncoder.run({ tokens });
const q = quantize8(encoderOutputs.hidden); // local quantization
const res = await relayPOST('/decode', { q, session: ephemeralSessionId });
// Receive logits or planned actions

Quantization function (toy example):

ts
function quantize8(arr: Float32Array): Uint8Array {
  // Clip and scale to 0..255. Calibrate range per-layer offline.
  const min = -3.0, max = 3.0;
  const out = new Uint8Array(arr.length);
  for (let i = 0; i < arr.length; i++) {
    const v = Math.max(min, Math.min(max, arr[i]));
    out[i] = Math.round(((v - min) / (max - min)) * 255);
  }
  return out;
}

Note: This is simplistic; better approaches include per-channel scales, learned quantizers, and privacy-aware training.

5) Safe Memory and Resource Hygiene

GPU memory bugs or leaks can degrade the experience and risk exposure.

Guidelines:

Zero-initialize buffers where possible. Some implementations guarantee robust resource initialization, but do not rely on it for sensitive data lifecycle.
Destroy GPU resources as soon as possible: buffer.destroy(), texture.destroy().
Avoid unbounded caches. Use content-addressed keys and LRU expiration in IndexedDB.
Keep per-tab memory budgets and enforce them; backpressure the agent if nearing limits.
Prefer streaming model loading (e.g., range requests) over monolithic downloads when supported by your runtime.
Avoid sending raw input data off-device; if the user opts in to cloud features, apply minimization and transformations first.

Example IndexedDB cache with content addressing:

ts
async function cachePut(db, keyHash, bytes) {
  const tx = db.transaction('models', 'readwrite');
  await tx.objectStore('models').put(bytes, keyHash);
  await tx.done;
}

async function cacheGet(db, keyHash) {
  const tx = db.transaction('models', 'readonly');
  const bytes = await tx.objectStore('models').get(keyHash);
  await tx.done;
  return bytes;
}

Compute a content hash client-side (e.g., SHA-256) and verify server-delivered artifacts against it. Use Subresource Integrity (SRI) for static model artifacts when feasible.

6) Deterministic Replay: Make Agent Runs Auditable

Agentic systems must be explainable. Deterministic replay requires capturing:

Initial inputs: prompt, DOM snapshot hash, model IDs, versions.
Planner decisions and their inputs (coarsened tiers).
Random seeds.
Tool invocations and results (normalized, without raw PII unless user opts in).
Timing metadata (optional, bounded precision).

Basic run log schema (JSON Lines stored locally by default):

json
{"v":1,"ts":1690000000000,"seed":12345,"planner":{"gpuTier":"fast","net":"4g"}}
{"action":"readDom","selector":"#price","resultHash":"sha256-..."}
{"action":"model","id":"base-q8","backend":"webgpu","inputHash":"sha256-...","outputHash":"sha256-..."}
{"action":"click","selector":"#buy","userConfirmed":true}

Keep full, raw content off-device by default. If the user opts in to bug reporting, upload only the minimal redacted subset plus content hashes.

Deterministic RNG:

ts
class RNG {
  private state: number;
  constructor(seed: number) { this.state = seed >>> 0; }
  next() {
    // xorshift32
    let x = this.state;
    x ^= x << 13; x ^= x >>> 17; x ^= x << 5;
    this.state = x >>> 0;
    return this.state / 0xffffffff;
  }
}

Use the RNG instance for any randomized sampling, temperature, or tool shuffling. Record the seed once per run.

For GPU deterministic math, note that floating-point minutiae vary across hardware and drivers. If exact bitwise replay is required, run verification on a CPU/WASM fallback or use integer-only kernels where possible. Another option is to store the model outputs (hashes and optionally compressed deltas) and verify against them during replay.

7) Privacy-First Fallback: Oblivious Transport and Minimal Metadata

When the planner chooses remote or split execution, preserve privacy:

Hide IP and user agents from the ML service using a relay or OHTTP (Oblivious HTTP; IETF RFC 9458). The relay sees IP but not payload; the server sees payload but not IP.
Strip all high-entropy headers; send only a coarse capability tier if necessary.
Use short-lived, anonymous session keys.
Encrypt payloads end-to-end between client and ML service.

Pseudo-relay usage:

ts
async function relayPOST(path: string, body: any) {
  const payload = new TextEncoder().encode(JSON.stringify(body));
  // Wrap payload using OHTTP or a relay’s E2EE scheme
  const wrapped = await ohttpWrap(payload, relayConfig);
  const res = await fetch(RELAY_URL + path, { method: 'POST', body: wrapped });
  const unwrapped = await ohttpUnwrap(new Uint8Array(await res.arrayBuffer()), relayConfig);
  return JSON.parse(new TextDecoder().decode(unwrapped));
}

Model assets:

Serve via content-addressed URLs (e.g., /models/sha256-<digest>.onnx) and cache aggressively.
Use Cache-Control: immutable and ETag with strong validators.
When downloading sensitive adapters or LoRA deltas, encrypt at rest in IndexedDB using a key derived per browser profile.

Compliance notes:

Document the privacy model: what leaves the device, under what user controls, with what protections.
Do not tie agent sessions to logged-in accounts by default; keep it anonymous unless users opt in.

8) Chrome UA/Client Hints Routing in Practice

Tie it together with a minimal backend and frontend.

Backend (Node/Express) to Advertise Hints and Route Assets

ts
import express from 'express';
const app = express();

app.use((req, res, next) => {
  res.setHeader('Accept-CH', [
    'Sec-CH-UA',
    'Sec-CH-UA-Platform',
    'Sec-CH-UA-Model',
    'Sec-CH-UA-Arch',
    'Sec-CH-UA-Bitness',
  ].join(', '));
  res.setHeader('Critical-CH', 'Sec-CH-UA-Platform');
  res.setHeader('Vary', 'Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Model, Sec-CH-UA-Arch, Sec-CH-UA-Bitness');
  next();
});

app.get('/route', (req, res) => {
  // High-entropy hints may be included if the browser has granted them for this origin.
  const platform = req.get('Sec-CH-UA-Platform');
  const arch = req.get('Sec-CH-UA-Arch');
  const bitness = req.get('Sec-CH-UA-Bitness');
  const model = req.get('Sec-CH-UA-Model');

  // Map to coarse tiers only; avoid logging raw values in production.
  let tier = 'generic';
  if (platform?.includes('Windows') && arch?.includes('x86') && bitness === '64') tier = 'win-x64';
  if (platform?.includes('Android')) tier = 'android';
  if (platform?.includes('Chrome OS')) tier = 'cros';

  // Provide a recommended default model family for the first run.
  const suggestion = (tier === 'android') ? 'tiny-q4' : 'base-q4';
  res.json({ tier, suggestion });
});

app.listen(3000);

Store only tier in your logs; drop model/arch to reduce fingerprinting risk.

Frontend Capability Pull, Probe, and Plan

ts
async function decidePlan() {
  const ua = await readClientHints();
  const gpu = await webgpuProbe();
  const net = {
    effectiveType: (navigator as any).connection?.effectiveType,
    saveData: !!(navigator as any).connection?.saveData,
  };
  const cache = { hasBase: !!await cacheGet(db, 'sha256-BASE'), hasTiny: !!await cacheGet(db, 'sha256-TINY') };

  // Optionally fetch server suggestion for first-time users (privacy-minimal)
  let suggestion: any = null;
  try { suggestion = await (await fetch('/route')).json(); } catch {}

  const plan = planRoute({ ua, gpu, net, cache });
  return { plan, suggestion };
}

9) Agent Loop: Tools, DOM, and Safety

An agent in the browser has powerful capabilities; they must be constrained.

Scope tools to the current origin by default.
Require user confirmation for actions that commit (e.g., purchase, delete, send).
Use a sandboxed iframe for tool execution with a strict Permissions Policy.
Record tool invocations to the run log with hashes, not raw content.

Skeleton agent loop:

ts
type Tool = (input: any, ctx: any) => Promise<any>;

const tools: Record<string, Tool> = {
  readDom: async ({ selector }) => {
    const el = document.querySelector(selector);
    const text = el?.textContent || '';
    return { text, hash: await sha256(text) };
  },
  click: async ({ selector }) => {
    const el = document.querySelector(selector) as HTMLElement;
    el?.click();
    return { ok: true };
  },
  fetchJSON: async ({ url }) => {
    if (!url.startsWith(location.origin)) throw new Error('cross-origin blocked');
    const res = await fetch(url);
    const json = await res.json();
    return { jsonHash: await sha256(JSON.stringify(json)) };
  },
};

async function agentRun(plan: Plan, seed: number) {
  const rng = new RNG(seed);
  const log: any[] = [];

  function record(entry: any) { log.push({ ts: Date.now(), ...entry }); }

  // Example: read price, decide, click buy with confirmation
  const domRes = await tools.readDom({ selector: '#price' }, {});
  record({ action: 'readDom', selector: '#price', resultHash: domRes.hash });

  const prompt = `User wants cheapest option under $100. Price hash: ${domRes.hash}`;
  const output = await runModel(plan, prompt, { rng });
  record({ action: 'model', id: plan.model, backend: plan.backend, inputHash: await sha256(prompt), outputHash: await sha256(output.text) });

  if (/buy/i.test(output.text)) {
    const ok = await confirmAction('The agent wants to click Buy. Proceed?');
    record({ action: 'confirm', ok });
    if (ok) {
      await tools.click({ selector: '#buy' }, {});
      record({ action: 'click', selector: '#buy' });
    }
  }

  return log;
}

Use a UI that makes the agent’s plan and justification visible. Allow users to step or pause.

10) Deterministic Server Execution for Fallback

For full reproducibility when falling back to a server:

Use the same model versions and quantization parameters as the client.
Accept the seed and plan decisions from the client; do not introduce server-side randomness without logging it.
Return output along with an output hash and a short proof of determinism (e.g., content hash of the used model and config).

Server response example:

json
{ "text": "Buy now.", "model": "base-q4", "modelHash": "sha256-...", "seed": 12345, "outputHash": "sha256-..." }

Verify on the client and include in the run log.

11) Performance and Stability Tactics

Warm up the backend right after the planner decides "local" (e.g., compile pipelines, pre-allocate small buffers).
Use progressive model loading: first run a very small model to produce a coarse plan, then switch to a larger model if needed.
Pin GPU workgroup sizes after probing to avoid re-compiles across runs.
Cache pipeline states keyed by WGSL code hash + adapter limits.
Avoid blocking the main thread; use Web Workers for model and planning work to keep UI responsive.

12) Security Hardening

Content Security Policy (CSP): disallow unsafe-eval, restrict connect-src to your relay and model endpoints.
Permissions Policy: disable features not needed (camera, microphone, geolocation) for the agent frame.
Cross-origin isolation only if necessary; if enabled, set COOP/COEP explicitly and test third-party integrations.
Input validation for any agent-provided URLs or selectors; whitelist within origin.
Do not expose filesystem or OS integration by default; gate behind explicit user opt-in.

13) Testing and Telemetry Without Exfiltration

Local telemetry: measure latency and memory, aggregate into coarse histograms, and only transmit with opt-in and k-anonymity thresholds.
Synthetic fixtures: use mocked DOMs and canned network responses to stress test decisions deterministically.
Fuzz tool inputs within sandboxed frames using seeded RNG and record outcomes.

14) Example End-to-End Flow

First load
- Server advertises minimal Client Hints; client fetches hints and probes GPU.
- Planner decides "local base-q4" with WebGPU.
- Model artifacts pulled from content-addressed CDN; verified and cached.
Agent session
- Agent reads DOM hash, builds prompt, runs local model.
- For a complex step, planner switches to split mode transiently (encoder local, decoder remote via relay). Only quantized activations leave the device, via OHTTP.
- User approves a click action; agent records confirmation and performs the action.
Replay
- Developer replays the run locally with the same seed, model hashes, and logs. Outputs match within tolerance. Any remote steps are re-requested with the same seed and payload through a test relay.

15) What to Avoid

Requesting excessive Client Hints or logging raw hints long-term.
Transmitting raw DOM, prompts, or user content to servers without explicit consent.
Relying solely on UA/Client Hints for routing; always complement with in-page probes.
Non-deterministic randomness or hidden heuristics that can’t be reproduced.
Large, unbounded caches that accumulate sensitive artifacts.

References and Pointers

WebGPU: W3C specification and MDN guides; measure with small, bounded compute.
WebNN: W3C Community Group; feature-detect and prefer when available.
ONNX Runtime Web: webgpu and wasm backends, model zoo and quantization tools.
UA Reduction and Client Hints: Chromium blog and developer docs; NavigatorUAData API.
Oblivious HTTP (RFC 9458): IETF standard for privacy-preserving request relaying.
Split Computing and Privacy: research on representation leakage and quantization-based mitigation.

Conclusion

An agentic browser runtime that runs on-device first, plans with privacy-preserving signals, and falls back through oblivious transport is not only possible—it’s practical today. By combining restrained Client Hints, in-page GPU probes, deterministic planning, and disciplined memory hygiene, you get responsiveness without regressions, adaptability without fingerprinting, and privacy without sacrificing functionality.

Treat determinism and privacy as first-class features. Cache by content, record just enough to replay, and minimize every piece of data that leaves the device. With these principles, your WebGPU/WebNN agent becomes a trustworthy companion that respects users and performs across the hardware spectrum.