Agentic Browser On-Device Inference: WebGPU/WASM, Split Execution, and User‑Agent–Aware Fallbacks
An agentic browser should do more than render pages and run scripts. It should reason, plan, and act. But if it does those things by round-tripping everything to the cloud, we miss the point of a modern client runtime: low-latency compute with privacy by default.
This article is a pragmatic, opinionated blueprint for shipping small language models and perception components inside the browser, executing with WebGPU or WASM, orchestrating split cloud/edge planning, and using browser user-agent and Client Hints to pick models and gate fallbacks. We will also verify via what-is-my-browser-agent checks for performance, privacy, and continuous integration.
TL;DR:
- Use WebGPU when available for LLM/KV-cache-heavy inference; WASM+SIMD+threads otherwise.
- Split execution so short-horizon planning and most tool calls happen locally; escalate to cloud for long-horizon reasoning or heavy models.
- Prefer feature detection to UA sniffing; use UA-CH and navigator.userAgentData primarily to preload the right assets and pick safe defaults.
- Gate fallbacks with hard caps on latency, energy, and memory; never silently ship 8B models to low-end devices.
- Verify your assumptions in CI with headless browsers, user-agent checks, and performance budgets.
1) Why do this on-device now?
Three trends make on-device agentic browsing viable:
- WebGPU is broadly shipping (Chrome/Edge stable; Safari 17+; Firefox desktop increasingly enabled), bringing compute shaders, fast storage buffers, and decent adapter limits to the web.
- Quantization for transformers (AWQ, GPTQ, SmoothQuant, group-wise 4-bit/3-bit with FP16 activations) reduces LLM memory footprints by 3–8x with manageable accuracy loss, plus KV-cache quantization and paged attention significantly reduce runtime and memory overhead.
- Mature runtimes exist: ONNX Runtime Web with WebGPU/WASM EPs, MLC WebLLM, TensorFlow.js with WebGPU backend, and WebNN in early but promising shape on some platforms.
The result: A 1–3B parameter instruction-tuned model in 4-bit can run entirely in the browser at interactive rates on modern laptops and some high-end mobiles. For example, published data points include:
- ONNX Runtime Web reported 10–19x speedups on transformer inference using WebGPU vs WASM for selected models, with further gains when fusing attention kernels and using cooperative matrix ops when available.
- WebLLM team shows Llama-derived 3–8B models generating in the 8–20 tok/s range on Apple M-series and recent discrete GPUs in-browser with paged KV cache; smaller models (e.g., ~1B) reach tens of tok/s on mid-range devices.
These are directional, not guaranteed. Your mileage will vary by adapter limits, driver quality, and model choice. But the orders of magnitude are real enough to justify pushing more of the agent stack into the browser.
2) Platform reality check: WebGPU, WASM, and WebNN
- WebGPU: Available by default in Chromium 113+, Safari 17+, and enabled or progressing on Firefox desktop. It exposes compute via WGSL, modern resource binding, and query features like timestamp queries for profiling. Adapter limits (e.g., maxStorageBufferBindingSize) vary widely; discrete GPUs are generous, integrated/mobile more constrained.
- WASM + SIMD + threads: Ubiquitous fallback with surprisingly good performance for small to mid models when combined with operator fusion and memory-friendly layouts. Requires cross-origin isolation (COOP/COEP) to enable SharedArrayBuffer for threading.
- WebNN: A high-level neural network API that can map to platform accelerators (e.g., DirectML on Windows). It is not uniformly shipped, but where available it can be a high-quality fallback or primary path.
Key takeaways:
- Always prefer feature detection, not user agent sniffing, to pick the execution backend.
- Cross-origin isolation is non-negotiable if you want thread-enabled WASM and certain perf-sensitive patterns; set COOP and COEP headers.
- Build a capability grid: WebGPU features, WASM SIMD/threads, memory headroom (navigator.deviceMemory for hinting), and adapter limits.
3) Model formats, quantization, and sizes
You do not need to ship frontier models to get agentic utility. A practical stack:
- Instruction-tuned 1–3B LLM for routing, brief planning, and short-form generation (Q4 or Q3 with FP16 activations). Think Llama-3 1B/3B, Phi-3 mini family, or Mistral 1B variants where licensing permits.
- Specialized extractors or rerankers (e.g., MiniLM, E5-small) for retrieval, selection, and tool arbitration.
- Tiny policy modules for DOM navigation, page segmentation, and data extraction.
Format and tooling options:
- ONNX format + ONNX Runtime Web (WebGPU and WASM EPs). Good operator coverage, stable tooling, familiar graph optimizations.
- MLC WebLLM format (MLC-compiled graphs with quantized weights, paged KV-cache, and optimized WebGPU kernels). Excellent for pure browser LLMs.
- TensorFlow.js for certain vision and small models; WebGPU backend is improving.
Quantization choices:
- Post-training: GPTQ, AWQ, and group-wise quantization are commonly used. 4-bit weight-only quantization with FP16/FP32 activations is a sweet spot. For chatty agents, quantize KV cache too (e.g., 8-bit or 4-bit KV) and use paged cache to bound memory.
- Operator fusion and layout: Fused attention, rotary embeddings baked-in, and pre-rotated position encodings reduce kernel count and overhead.
Model size budgeting:
- Use adapter limits and navigator.deviceMemory to pick between 1B, 2B, and 3B variants. A 1B model in 4-bit might be ~0.6–1.2 GB including embeddings and extra buffers; a 3B can be 2–3 GB. That is close to or beyond what many mobiles and low-end laptops can spare.
- Gate download size and memory with hard caps. Never assume you can push a 2+ GB asset to a random mobile user.
4) Split execution: planning at the edge, escalation to cloud
Fully local is ideal but not always feasible. A sensible split execution:
- Local planner and fast reflexes: Use a small on-device LLM to interpret user intent, propose short action sequences, and perform DOM/JS tool calls. Keep token budgets small to avoid KV explosion.
- Local value function / critic: A distilled scorer to evaluate plan quality, uncertainty, or risk, optionally using cheap heuristics and shallow models.
- Cloud escalations: Long-horizon reasoning, long-context summarization, heavy structured extraction, or when device capability is insufficient. Use explicit user consent, and display an indicator when a remote call is made.
This is akin to model cascades and mixture-of-experts across device/cloud. The on-device planner is latency-critical; the cloud path is accuracy-capable. Design for determinism where possible (temperature ~0) to stabilize agents.
An example flow:
- The user asks: Summarize this 6,000-word article and extract a table of key dates.
- Local policy: Detects context is long; local LLM proposes a chunking plan, runs a lightweight extractor locally on each chunk to produce candidate facts.
- Confidence threshold not met; escalate: Send only minimized structured notes (not full text) to a cloud LLM for high-accuracy consolidation. If user opts out of cloud, fallback to best-effort local summarization.
Measure the end-to-end with budgets: e.g., P95 under 600 ms for local small tasks; under 2–3 s with one escalation.
5) Capability detection, Client Hints, and user agent awareness
Prefer runtime feature detection to UA sniffing. However, UA and UA-CH are useful for proactive asset selection and CI.
Runtime capability probes:
jsasync function detectCapabilities() { const caps = { webgpu: false, webgpuAdapter: null, webgpuLimits: null, wasmSimd: typeof WebAssembly === 'object' && WebAssembly.validate( new Uint8Array([0,97,115,109,1,0,0,0]) // minimal module check ), wasmThreads: crossOriginIsolated && !!SharedArrayBuffer, deviceMemory: navigator.deviceMemory || 4, userAgentData: navigator.userAgentData || null, }; if (navigator.gpu) { try { const adapter = await navigator.gpu.requestAdapter(); if (adapter) { const device = await adapter.requestDevice(); caps.webgpu = true; caps.webgpuAdapter = adapter; caps.webgpuLimits = device.limits; device.destroy(); } } catch (e) { console.warn('WebGPU present but device request failed', e); } } return caps; }
Server-side hints with UA-CH:
- Request high-entropy Client Hints for platform/arch to pick model size and binary variants.
- Headers to send from server (to get hints on subsequent requests):
Accept-CH: Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Platform-Version, Sec-CH-UA-Arch, Sec-CH-UA-Bitness, Sec-CH-UA-Model
Permissions-Policy: ch-ua=(self), ch-ua-platform=(self), ch-ua-platform-version=(self), ch-ua-arch=(self), ch-ua-bitness=(self), ch-ua-model=(self)
In JS, you can also request high-entropy values when available:
jsasync function getUAHighEntropy() { if (!navigator.userAgentData?.getHighEntropyValues) return null; return navigator.userAgentData.getHighEntropyValues([ 'platform', 'platformVersion', 'architecture', 'bitness', 'model', 'uaFullVersion' ]); }
Using UA-CH wisely:
- Use it to prefetch the right asset family (e.g., smaller model for Android low-bitness devices).
- Do not rely on it for feature gating; still feature-detect WebGPU/WASM at runtime.
- Respect privacy budgets; do not collect high-entropy hints cross-site without justification.
User agent awareness for CI:
- Record the reported UA string and UA-CH on every test run and attach them to perf baselines.
- Use a what is my browser agent endpoint to echo UA and CH back; assert that your detection logic yields the expected backend and model tier.
6) Fallback gating and safety switches
A robust agentic browser must not degrade into poor UX on weak devices. Implement hard caps and clear fallbacks:
- Latency budget caps (e.g., cancel local inference if token/sec falls below threshold after warmup).
- Memory cap based on deviceMemory, adapter limits, and measured peak resident set; if estimate exceeds cap, choose a smaller model or cloud path.
- Energy cap for mobile (approximate using heuristics: long tasks, thermal state if available, and visibility).
- Privacy cap: never auto-escalate to cloud without prior consent; persist that choice per origin and show clear UI.
Pseudocode for model tiering:
jsfunction pickModelTier(caps) { // Basic heuristics; adjust with empirical data const mem = caps.deviceMemory || 4; // GB approximate const hasWebGPU = caps.webgpu; const maxBuf = hasWebGPU && caps.webgpuLimits ? caps.webgpuLimits.maxStorageBufferBindingSize : 128 * 1024 * 1024; // Storage buffer >= 256MB and deviceMemory >= 8 => 3B ok if (hasWebGPU && mem >= 8 && maxBuf >= 256 * 1024 * 1024) return '3b-q4'; // WebGPU or WASM+SIMD with >=4GB => 1-2B if ((hasWebGPU || caps.wasmSimd) && mem >= 4) return '1b-q4'; // Otherwise tiny model or cloud return 'tiny-or-cloud'; }
7) Packaging and caching models for the web
You cannot just drop a multi-GB file on users and hope caching saves you. Treat models like large media:
- Chunked assets: Segment weights into 4–16 MB chunks with content-addressed filenames. This enables range fetching, partial updates, and parallelization.
- Streaming decode: Use fetch streaming and transform streams to directly upload to GPU buffers or WASM memory as chunks arrive.
- Service Worker: Version models via URL fingerprints; prefetch when plugged-in or on Wi‑Fi; evict old versions aggressively.
- Integrity: Use Subresource Integrity (SRI) for small binaries and a signed manifest for chunk hashes.
- Storage: Cache in IndexedDB; do not abuse localStorage. Provide a model management UI for clearing cache.
Example: a simple service worker to cache model chunks.
js// sw.js const MODEL_CACHE = 'model-cache-v3'; self.addEventListener('install', (e) => { self.skipWaiting(); }); self.addEventListener('activate', (e) => { e.waitUntil((async () => { const keys = await caches.keys(); await Promise.all(keys.filter(k => k !== MODEL_CACHE).map(k => caches.delete(k))); await self.clients.claim(); })()); }); self.addEventListener('fetch', (e) => { const url = new URL(e.request.url); if (url.pathname.startsWith('/models/')) { e.respondWith((async () => { const cache = await caches.open(MODEL_CACHE); const hit = await cache.match(e.request); if (hit) return hit; const resp = await fetch(e.request, { integrity: url.searchParams.get('sri') || undefined }); if (resp.ok) cache.put(e.request, resp.clone()); return resp; })()); } });
8) WebGPU and WASM implementation patterns
Backend selection and session creation with ONNX Runtime Web:
jsimport * as ort from 'onnxruntime-web'; async function createOnnxSession(modelUrl, caps) { const providers = []; if (caps.webgpu) providers.push({ name: 'webgpu' }); providers.push({ name: 'wasm', wasm: { numThreads: caps.wasmThreads ? navigator.hardwareConcurrency || 4 : 1, simd: caps.wasmSimd } }); const session = await ort.InferenceSession.create(modelUrl, { executionProviders: providers.map(p => p.name), graphOptimizationLevel: 'all', enableMemPattern: true, extra: { session: 'agentic-browser' } }); return session; }
If you want a pure browser LLM with WebLLM (MLC):
jsimport { CreateMLCEngine, preloadModel } from '@mlc-ai/web-llm'; const registry = { 'llama3-1b-q4': 'https://cdn.example.com/mlc/llama3-1b-q4f16.bin', 'llama3-3b-q4': 'https://cdn.example.com/mlc/llama3-3b-q4f16.bin' }; async function loadLLM(modelKey) { await preloadModel({ model: modelKey, modelUrl: registry[modelKey] }); const engine = await CreateMLCEngine(modelKey, { context_window_size: 4096, gpu_memory_utilization: 0.85, wasm_num_threads: navigator.hardwareConcurrency || 4 }); return engine; } async function generate(engine, prompt, maxTokens = 128) { let output = ''; for await (const chunk of engine.generate({ prompt, max_tokens: maxTokens, temperature: 0 })) { output += chunk.output_text; } return output; }
Split planner/actor across device/cloud:
jsasync function planAndAct(userInput, caps, prefs) { const tier = pickModelTier(caps); const localOnly = prefs.offline === true; const plan = await localPlanner(userInput, tier); // small LLM propose steps const ok = await localCritic(plan); // cheap scorer if (ok || localOnly) return await localExecutor(plan); // Escalate with minimal context const redacted = redact(plan.context); const cloudPlan = await fetch('/cloud/plan', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ user_input: userInput, plan: plan.steps, context: redacted }) }).then(r => r.json()); return await localExecutor(cloudPlan); }
Headers for cross-origin isolation (required for WASM threads and to avoid subtle perf cliffs):
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Content-Security-Policy: script-src 'self' 'wasm-unsafe-eval'; worker-src 'self'; connect-src 'self' https://cdn.example.com
WASM thread pool with cross-origin isolation check:
jsfunction initThreadedWasm() { if (!crossOriginIsolated) { console.warn('Not cross-origin isolated; falling back to single-thread WASM'); return { threads: 1 }; } const threads = Math.max(2, Math.min(navigator.hardwareConcurrency || 4, 8)); return { threads }; }
9) Measuring and optimizing latency
Profiling in the browser:
- WebGPU timestamp queries: Use GPUQuerySet with type = timestamp to bracket kernels; compute GPU time deltas.
- JS markers: performance.mark/measure around tokenization, graph runtime, and post-processing.
- Energy proxies: Document visibility, battery API (limited), and thermal throttling events where available.
Optimization playbook:
- Warmup: Run a short dry-run prompt to compile pipelines and upload weights; cache pipeline layouts.
- KV-cache paging: Use paged attention to avoid unbounded memory growth with longer prompts.
- Batch small tasks: Coalesce tool calls and I/O; reduce context-churn between agent steps.
- Keep the graph hot: Avoid tearing down devices/sessions; reuse across tabs with a shared worker.
Budget example:
- Target P50 local plan <= 200 ms; P95 <= 600 ms.
- Token generation rate target: 10 tok/s for short completions; set cutoffs if below 3 tok/s.
- Max download budget: 50–200 MB on metered or mobile networks unless user consents to larger bundles.
10) Privacy by design
On-device inference is not just a performance hack; it is a privacy stance.
- Default to local; escalate only with explicit consent. Display a clear indicator (e.g., in omnibox or toolbar) when the agent uses cloud.
- Data minimization: When escalating, redact and summarize locally first; send only what is necessary.
- Client Hints discipline: Request high-entropy UA-CH only when needed; scope via Permissions-Policy; avoid cross-site leakage.
- Telemetry: Opt-in. Aggregate latency, success, and backend selection; do not log prompts.
Regulatory trendlines (GDPR, DMA, state privacy laws) are aligned with this approach. Being proactive here saves rework later.
11) Verification in CI: what is my browser agent and perf budgets
Build your CI around the reality of browsers in the wild:
- Use Playwright or WebDriver BiDi to spin up Chrome, Edge, Safari, and Firefox with WebGPU enabled where possible.
- Record user-agent and UA-CH values, confirm backend decisions, and check that feature detection yields the expected path.
- Regress perf budgets; fail the build if P95 exceeds your SLA.
Sample Playwright test:
tsimport { test, expect } from '@playwright/test'; // Enable WebGPU in headless contexts where flags are needed test.use({ launchOptions: { args: [ '--enable-features=Vulkan,UseSkiaRenderer,UnsafeWebGPU', '--enable-dawn-features=disallow_robustness' ] } }); test('backend selection and UA echo', async ({ page }) => { await page.goto('https://localhost:8443'); const ua = await page.evaluate(() => navigator.userAgent); const ch = await page.evaluate(async () => navigator.userAgentData?.getHighEntropyValues ? await navigator.userAgentData.getHighEntropyValues(['platform','architecture','bitness']) : null); const caps = await page.evaluate(async () => await detectCapabilities()); expect(caps.webgpu || caps.wasmSimd).toBeTruthy(); await page.goto('https://localhost:8443/what-is-my-agent'); const reported = await page.locator('#ua-json').textContent(); expect(reported).toContain('userAgent'); });
A trivial what is my browser agent endpoint:
js// Express.js snippet app.get('/what-is-my-agent', (req, res) => { res.set('Content-Type', 'text/html'); const data = { userAgent: req.get('User-Agent'), ch: { ua: req.get('Sec-CH-UA'), platform: req.get('Sec-CH-UA-Platform'), platformVersion: req.get('Sec-CH-UA-Platform-Version'), arch: req.get('Sec-CH-UA-Arch'), bitness: req.get('Sec-CH-UA-Bitness'), model: req.get('Sec-CH-UA-Model') } }; res.send(`<pre id='ua-json'>${JSON.stringify(data, null, 2)}</pre>`); });
12) Opinionated playbook and pitfalls
My opinions, honed by painful experience:
- Feature detect first, UA-CH second, UA string last. UA strings are both sparse and misleading in modern browsers.
- Ship the smallest model that gives you reliable planning. Use cloud for the rare heavy cases, not by default.
- Never silently fall back to a slow path that degrades UX. If the WebGPU path fails mid-session, show a clear toast and switch to a tiny model or ask for permission to escalate.
- Avoid giant monolithic model files. Chunk, stream, and verify.
- Quantize aggressively, but measure. Some 3-bit schemes look great on paper and fall apart on long prompts with tool chatter.
- Invest in tokenization speed. A slow tokenizer can erase WebGPU gains. Precompile, use SIMD, and cache.
- Be transparent with users. Show a network indicator whenever a model call leaves the device.
Common pitfalls:
- Failing to set COOP/COEP and then wondering why threads are disabled and perf tanks.
- Storing models in localStorage or blowing past storage quotas without a cleanup story.
- Assuming GPU memory equals system memory; mobile iGPUs have tight budgets and drivers may silently evict buffers.
- Using UA-CH as a gate for correctness decisions; always re-validate at runtime.
- Not testing on battery. Thermal throttling can turn a 10 tok/s device into 2 tok/s in minutes.
13) Roadmap and what to watch
- WebGPU cooperative matrices and subgroup features: Expect further speedups for attention and matmul-heavy workloads as these land more broadly.
- Memory64 for WASM: Removes the 4 GB barrier; still rolling out but will eventually enable larger graphs.
- WebNN: As it gains adoption and backends, it may become a simpler high-level API for many models.
- KV cache offloading: Smarter paging and compression for long conversations on constrained devices.
- Hardware queries: Better privacy-preserving hints about device class could make initial model selection safer.
References and further reading
- WebGPU spec and status: W3C GPU for the Web
- ONNX Runtime Web docs and WebGPU EP blogs by Microsoft
- MLC AI WebLLM repository and blog on paged KV cache and WebGPU kernels
- Chrome Client Hints (UA-CH) developer guide and privacy model
- Cross-origin isolation (COOP/COEP) best practices in MDN
- Playwright docs for browser automation and performance testing
These resources evolve quickly; rely on them for current details and examples.
Conclusion: An agentic browser that runs small LLMs locally with WebGPU/WASM, orchestrates split execution, and selects models with user-agent–aware heuristics is not sci-fi; it is a tractable engineering project. The wins in latency, privacy, and resilience are substantial. The primary challenge is discipline: ship small, measure relentlessly, and design fallbacks that respect users. With that, you get an AI browser that feels instantaneous, private, and trustworthy.