Executive summary
Agentic browsers—the automated, autonomous browsing agents that drive modern AI workflows—need fleet operations, not ad‑hoc scripts. At scale, the problems look like any multi‑tenant distributed system: isolation, scheduling, quotas, SLOs, policy, observability, and security. But there is a twist: many websites drive access control, personalization, and anti‑abuse from the browser fingerprint itself, increasingly based on User‑Agent Client Hints (UA‑CH) rather than the legacy User‑Agent string.
This article proposes a concrete control‑plane design for multi‑tenant agentic browser fleets. It includes:
- Kubernetes-based scheduling and pool management
- Per‑tenant sandboxes with strong isolation
- Quotas, SLOs, and error budgets focused on browser identity correctness
- A policy‑driven Browser Agent Switcher (User‑Agent string + Client Hints, locale, timezone, device class)
- A "what is my browser agent" telemetry service and SLOs
- Geo/locale pools for region and language fidelity
- Real‑time browser agent security risk budgets with enforcement actions
The intended audience is engineers operating large‑scale automation fleets, browser testing providers, and AI platform teams.
Why agentic browsers need fleet ops
Single instances of Puppeteer or Playwright work for demos. Production demands:
- Multi‑tenancy: One platform, many tenants. Hard guarantees that agents, data, and budgets don’t bleed across tenants.
- Determinism: A target site’s behavior often depends on UA/CH, locale, timezone, viewport, hardware class. You must be able to set and verify those.
- Policy: Tenant‑ and destination‑aware policies decide which browser agent profile to present.
- Risk management: Automated browsing touches security controls (WAFs, bot defenses, CAPTCHAs). You need a budgeted model for risky actions.
- Observability: You can’t improve what you can’t measure. “What browser am I?” is not trivia; it’s an SLI.
Opinion: Context-level isolation inside a single headless browser is not enough for multi‑tenant production. Use process‑level or VM‑level isolation per tenant (or even per session at high sensitivity). This drives the architecture below.
Design goals and non‑goals
Goals
- Strong tenant isolation without sacrificing density and cost control
- Policy‑driven browser identity with auditable decisions
- First‑class SLOs for agent identity fidelity and target‑site reachability
- Real‑time risk budgets and safe‑mode fallbacks
- Clean extensibility: adding a new agent profile, region, or policy is declarative
Non‑goals
- Circumventing site defenses. We focus on compliant automation: correctness, reliability, and risk limits, not evasion.
- Full‑stack antifingerprint research. We reference standards (W3C UA‑CH, Chromium UA Reduction) and operational best practices.
References for identity and hints
- RFC 9110 (HTTP Semantics) describes the legacy User‑Agent header
- Chromium’s User‑Agent Reduction initiative (ships stable): shifts identity to Client Hints
- W3C UA Client Hints: Sec‑CH‑UA, Sec‑CH‑UA‑Platform, Sec‑CH‑UA‑Mobile, Sec‑CH‑UA‑Full, etc.
Architecture overview
High‑level components
- Control plane
- Scheduler and operator: creates BrowserSession pods in the right geo pool with the approved BrowserAgentClass
- Policy engine (OPA/Gatekeeper): validates tenant policy, agent selection, and risk budgets
- Quota/SLO subsystem: Prometheus, Alertmanager, error budget tooling
- "What is my browser agent" telemetry service: authoritative echo + validation
- Risk engine: real‑time counters and rules; enforces safe mode
- Data plane
- Browser pods: Playwright/Puppeteer/ChromeDriver under gVisor/Firecracker/Kata
- Proxy / egress tier: per‑geo IP pools; TLS pinning limits; DNS control
- Storage: ephemeral workspace + optional tenant‑scoped, encrypted object store
Simplified flow
- Tenant requests a BrowserSession with intent (target domain), policies, and SLO class.
- Policy picks a BrowserAgentClass (UA/CH profile, locale, timezone, viewport), geo pool, and security posture.
- Scheduler starts the session in the geo pool node group with the required sandbox.
- Browser boots, calls the telemetry service to verify presented identity; failure triggers switcher adjustments or session fail.
- Risk engine monitors actions (downloads, file uploads, login forms, unusual JS APIs) and burns against a per‑tenant budget.
- On budget breach or high burn rate, control plane enforces safe mode or blocks.
ASCII diagram
Tenant → Control API → Policy Engine → Scheduler/Operator → k8s
↓ ↓ ↓ ↓
Risk Engine Agent Switcher Geo/Pool Browser Pods
↓ ↓ ↓ ↓
Telemetry + SLOs ← "What is my browser agent" ← Egress/Proxy
Kubernetes scheduling and pools
Key k8s primitives and settings
- Namespaces per tenant; ResourceQuota + LimitRange for CPU/RAM/GPU
- Node pools (node groups) per geo/locale, labeled for selection
- Taints/tolerations to keep certain tenants in hardened pools (microVM‑backed)
- PriorityClass for preemption and reserved capacity for SLO‑critical workloads
- PodTopologySpread to avoid noisy neighbor and zone‑level risk concentration
- RuntimeClass: gVisor or Kata (Firecracker) per security tier
- PodSecurity level: restricted; seccomp, AppArmor, dropped capabilities
Pool topology
- pool-us-east, pool-eu-central, pool-ap-sg with node labels like region=us-east, geo=us, locale=en-US
- Hardened pool for high‑risk tenants or targets: runtimeClassName: kata, egress via isolated IP range
Example: Node labels and selection in a BrowserSession
yamlapiVersion: agent.example.com/v1 kind: BrowserSession metadata: name: session-123 namespace: tenant-a spec: targetDomain: "docs.example.org" agentClass: "chrome-stable-desktop-122" geoPool: "us-east" runtimeClassName: kata nodeSelector: agent.geo/region: us-east agent.geo/locale: en-US tolerations: - key: "agent.pool/hardened" operator: "Exists" resources: requests: cpu: "500m" memory: "1Gi" limits: cpu: "2" memory: "4Gi"
Per‑tenant sandboxes and isolation
Isolation ladder (choose per risk profile):
- Process isolation: one pod per tenant with multiple browser contexts (lowest isolation)
- PID/user‑ns + gVisor: syscall interception, good density and strong kernel isolation
- MicroVM (Kata/Firecracker): VM boundary per pod; highest isolation
Hardening checklist
- Non‑root containers; immutable root FS; drop all capabilities
- Seccomp: default runtime or custom profile; AppArmor enforcement
- Read‑only /home; tmpfs workdir; no hostPath mounts
- Egress via eBPF/Cilium NetworkPolicies; deny‑all default
- No secret env vars; use projected service account tokens; secret mounts with fsGroup
Example PodSecurityContext and container securityContext
yamlsecurityContext: runAsNonRoot: true runAsUser: 10001 runAsGroup: 10001 fsGroup: 10001 seccompProfile: type: RuntimeDefault --- containers: - name: browser image: ghcr.io/example/agent-browser:122 securityContext: allowPrivilegeEscalation: false readOnlyRootFilesystem: true capabilities: drop: ["ALL"]
Why not multiplex tenants in one browser process? Side‑channels and defects in browser isolation are real. Given the operational cost of an incident, per‑tenant pods with gVisor or Kata is a defensible default.
Quotas, SLOs, and error budgets
Define per‑tenant quotas
- Concurrency: max sessions per tenant
- Runtime: max minutes per day
- Egress: max outbound bytes per day and per domain
- Feature flags: downloads, file dialogs, WebUSB/WebSerial disabled by default
SLIs for agentic browsers (examples)
- Identity fidelity: percentage of sessions where the echo service confirms expected UA/CH/locale/timezone within 30s
- Reachability: percentage of sessions reaching the target domain homepage within 60s without 4xx/5xx
- Stability: session completion without crash/timeout
- Safety: percentage of sessions with no high‑risk API usage
SLOs and error budgets
- Identity SLO: 99.5% 30‑second confirmation
- Reachability SLO: 99.0% 1‑minute success
- Stability SLO: 99.2%
- Safety SLO: 99.9% no high‑risk events
Prometheus recording + alerting (burn rate example)
yaml# SLI: identity_confirmed record: sli:identity_confirmed:ratio expr: sum(rate(agent_identity_confirmed_total[5m])) / sum(rate(agent_sessions_started_total[5m])) --- # SLO: 99.5%. Multi‑window burn rate alerts (fast/slow) - alert: IdentitySLOFastBurn expr: (1 - sli:identity_confirmed:ratio) > (1 - 0.995) * 14 for: 5m labels: {severity: critical} annotations: summary: Fast burn for identity SLO - alert: IdentitySLOSlowBurn expr: (1 - sli:identity_confirmed:ratio) > (1 - 0.995) * 6 for: 1h labels: {severity: warning} annotations: summary: Slow burn for identity SLO
Policy‑driven Browser Agent Switcher
The switcher picks the presented identity at runtime:
- UA string: legacy header used by many sites
- UA‑CH: Sec‑CH‑UA, Sec‑CH‑UA‑Platform, Sec‑CH‑UA‑Mobile, Sec‑CH‑UA‑Arch, etc. (subject to Accept‑CH)
- Accept‑Language, timezone, geolocation, viewport, device pixel ratio
- Cookie mode, storage partitioning, third‑party cookie policy
Policy inputs
- Tenant policy and risk tier
- Target domain constraints (compatibility matrix, e.g., some sites break on reduced UA)
- Geo/locale requirement
- SLO history (e.g., site X has 2% mismatch; choose hardened profile)
- Risk budget remaining (if low, downgrade to safe profile)
Representing agent classes
yamlapiVersion: agent.example.com/v1 kind: BrowserAgentClass metadata: name: chrome-stable-desktop-122 spec: engine: chromium version: "122.0" userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" uaClientHints: brands: - brand: "Chromium" version: "122" - brand: "Not(A:Brand)" version: "24" fullVersion: "122.0.6261.0" platform: "Windows" platformVersion: "15.0" architecture: "x86" model: "" mobile: false viewport: width: 1366 height: 768 deviceScaleFactor: 1 timezoneId: "America/New_York" locale: "en-US" thirdPartyCookies: "block" features: downloads: false webusb: false
OPA policy to allow only declared agent classes per tenant
regopackage agent.policy # input: { tenant, requestedClass, targetDomain } allowed_classes = {"chrome-stable-desktop-122", "chrome-stable-mac-122"} violation[msg] { not input.requestedClass == allowed_classes[_] msg := sprintf("Agent class %v not allowed for tenant %v", [input.requestedClass, input.tenant]) }
Playwright example: set UA, UA‑CH, locale, timezone, viewport
tsimport { chromium, devices } from 'playwright'; (async () => { const context = await chromium.launchPersistentContext('', { headless: true, userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36', locale: 'en-US', timezoneId: 'America/New_York', viewport: { width: 1366, height: 768 }, permissions: [], extraHTTPHeaders: { // Some environments support UA‑CH override; Playwright passes through headers 'Sec-CH-UA': '"Chromium";v="122", "Not(A:Brand)";v="24"', 'Sec-CH-UA-Platform': '"Windows"', 'Sec-CH-UA-Platform-Version': '"15.0"', 'Sec-CH-UA-Mobile': '?0', 'Accept-Language': 'en-US,en;q=0.9' } }); const page = await context.newPage(); await page.goto('https://what-is-my-browser-agent.example.com/echo'); console.log(await page.textContent('pre#headers')); await context.close(); })();
Chromium CDP allows deeper UA‑CH control. Puppeteer example:
jsconst puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch({ headless: 'new' }); const page = await browser.newPage(); const client = await page.target().createCDPSession(); await client.send('Network.setUserAgentOverride', { userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36', acceptLanguage: 'en-US', platform: 'Windows', userAgentMetadata: { brands: [{ brand: 'Chromium', version: '122' }, { brand: 'Not(A:Brand)', version: '24' }], fullVersion: '122.0.6261.0', platform: 'Windows', platformVersion: '15.0', architecture: 'x86', model: '', mobile: false } }); await page.goto('https://what-is-my-browser-agent.example.com/echo'); await browser.close(); })();
Note: UA‑CH headers are formally sent in response to server Accept‑CH opt‑in and user privacy preferences; CDP overrides simulate metadata for automation and testing.
Decision logic sketch (switcher)
yamlapiVersion: agent.example.com/v1 kind: TenantPolicy metadata: name: tenant-a spec: defaultAgentClass: chrome-stable-desktop-122 targetOverrides: - domains: ["*.bank.example"] agentClass: chrome-stable-desktop-122 thirdPartyCookies: block hardened: true - domains: ["*.legacy.example"] agentClass: chrome-legacy-desktop-114 riskDowngrade: onBudgetBelowPct: 0.2 safeAgentClass: chrome-stable-desktop-122-safe
"What is my browser agent" telemetry and SLOs
Purpose
- Authoritative echo of what the outside world sees: UA, UA‑CH, Accept‑Language, IP/ASN, TLS properties, timezone via JS, screen/viewport
- Compare against the selected BrowserAgentClass and geo/locale pool
- Emit metrics and traces for SLOs and debugging
Echo service design
- Stateless HTTP service; surfaces request headers, IP, resolved geo, and a JS payload computing timezone, devicePixelRatio, screen size
- Returns a signed JSON blob so the browser agent can include it in its trace
Example echo response (truncated)
json{ "ip": "203.0.113.10", "asn": 64496, "geo": {"country":"US","region":"VA","city":"Ashburn"}, "headers": { "user-agent": "Mozilla/5.0 ... Chrome/122.0.0.0 Safari/537.36", "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand)\";v=\"24\"", "sec-ch-ua-platform": "\"Windows\"", "accept-language": "en-US,en;q=0.9" }, "js": {"timezone":"America/New_York","dpr":1,"screen":"1366x768"}, "ts": 1735689600 }
Validation logic
- Compare UA/CH fields with agent class
- Check geo pool match (country/region) and ASN allowlist
- Check timezone and locale
- Produce a pass/fail and reason codes
OpenTelemetry metrics/traces
pythonfrom opentelemetry import trace, metrics from opentelemetry.sdk.metrics import MeterProvider from opentelemetry.sdk.resources import Resource resource = Resource.create({"service.name": "agent-echo"}) meter = MeterProvider(resource=resource).get_meter("agent-echo") confirmed = meter.create_counter("agent_identity_confirmed_total") failed = meter.create_counter("agent_identity_failed_total") # When validating a session if valid: confirmed.add(1, {"tenant": tenant, "agentClass": agent_class, "geo": geo}) else: failed.add(1, {"tenant": tenant, "reason": reason})
SLO check: Identity must be confirmed within 30 seconds of session start; otherwise the switcher retries once with an alternate agent class or fails fast for deterministic behavior.
Geo/locale pools
Reasons
- Sites localize content and access based on IP and Accept‑Language
- Latency matters for anti‑automation timing thresholds
- Legal compliance: data residency and export control
Kubernetes placement
yamlapiVersion: agent.example.com/v1 kind: GeoPool metadata: name: us-east spec: nodeSelector: agent.geo/region: us-east egressCIDRs: ["198.51.100.0/24"] locales: ["en-US"]
Node affinity in sessions
yamlaffinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: agent.geo/region operator: In values: ["us-east"]
Browser locale and timezone in Playwright
tsconst context = await chromium.launchPersistentContext('', { locale: 'en-US', timezoneId: 'America/New_York', });
Optional: geolocation mock (when permitted) for sites that rely on the Geolocation API
tsawait context.grantPermissions(['geolocation']); await page.setGeolocation({ latitude: 40.7128, longitude: -74.0060 });
Real‑time browser agent security risk budgets
Define budget categories
- Target risk: domains with stricter ToS or sensitive surfaces
- Action risk: file downloads/uploads, authentication flows, payment flows
- API risk: WebUSB/WebSerial, getUserMedia, clipboard write, cross‑origin iframes with storage access
- Anomaly risk: high request rate, repeated form submissions, WAF blocks, captcha triggers
Model
- Each event carries a risk weight (w)
- Per tenant budget B over rolling window T (e.g., 24h); track burn R = sum(w)
- Enforcement thresholds: warn at 50%; degrade at 80%; block at 100%
Go snippet for a budget service
gotype Budget struct { Window time.Duration Limit float64 } type Event struct { Tenant string Weight float64 Ts time.Time } // Using Redis sorted sets keyed by tenant with timestamps as scores func AddEvent(ctx context.Context, rdb *redis.Client, e Event) error { key := fmt.Sprintf("risk:%s", e.Tenant) _, err := rdb.ZAdd(ctx, key, redis.Z{Score: float64(e.Ts.Unix()), Member: e.Weight}).Result() return err } func GetBurn(ctx context.Context, rdb *redis.Client, tenant string, window time.Duration) (float64, error) { now := time.Now() key := fmt.Sprintf("risk:%s", tenant) // Trim old rdb.ZRemRangeByScore(ctx, key, "-inf", fmt.Sprint(now.Add(-window).Unix())) // Sum weights vals, _ := rdb.ZRangeWithScores(ctx, key, 0, -1).Result() burn := 0.0 for _, v := range vals { burn += v.Member.(float64) } return burn, nil } func Enforce(burn, limit float64) string { pct := burn / limit switch { case pct >= 1.0: return "block" case pct >= 0.8: return "degrade" case pct >= 0.5: return "warn" default: return "ok" } }
Integration with the switcher
- ok: normal profile
- warn: raise telemetry level; keep profile
- degrade: switch to safe agent class (no downloads, stricter cookies, hardened pool)
- block: fail session creation for sensitive targets; require manual override
Prometheus counters for risk categories
yaml- record: tenant:risk_burn:sum_rate5m expr: sum by(tenant) (rate(agent_risk_weight_sum[5m]))
Burn‑rate alerts tied to budgets:
yaml- alert: RiskBudgetFastBurn expr: tenant:risk_burn:sum_rate5m > (tenant_risk_budget_limit * 0.2) for: 10m labels: {severity: critical} annotations: summary: Tenant is burning risk budget too fast
Safe agent class example
yamlapiVersion: agent.example.com/v1 kind: BrowserAgentClass metadata: name: chrome-stable-desktop-122-safe spec: userAgent: "Mozilla/5.0 ... Chrome/122.0.0.0 Safari/537.36" uaClientHints: mobile: false features: downloads: false webusb: false clipboardWrite: false thirdPartyCookies: "block" runtimeClassName: kata
Observability and control
Metrics
- agent_sessions_started_total{tenant,agentClass,geo}
- agent_identity_confirmed_total{tenant,agentClass}
- agent_target_reach_success_total{tenant,domain}
- agent_risk_weight_sum{tenant,category}
- browser_crash_total{tenant,agentClass}
Traces
- One trace per session: start → switcher decision → echo check → target navigation → actions → teardown
Logs
- Structured JSON; redact PII; correlate with trace IDs (W3C TraceContext)
Dashboards
- Identity SLO per tenant and per agent class
- Reachability by domain with quick regression detection
- Risk burn and enforcement events over time
Security posture
Network and egress
- Default‑deny NetworkPolicies; explicit allow for DNS and proxy
- Egress IP ranges per geo pool; static ASN allowlist if required
- TLS: disable MITM for session content unless explicit DLP mode is enabled with tenant consent
Container supply chain
- Minimal base images; pinned, verified digests; SBOMs
- Image signing (Sigstore/cosign) and admission checks
- Regular patching; CI scanning for CVEs
Secrets and identity
- SPIFFE IDs per pod; mTLS between pods and control plane
- Kubernetes projected service account tokens with short TTL
- Per‑tenant KMS keys; AES‑GCM for at‑rest storage
Browser safety defaults
- Disable dangerous APIs unless explicitly allowed
- No persistent cookies across sessions unless tenant‑scoped and encrypted
- Clear storage and cache on teardown
Operational playbooks
Rollouts and canaries
- Introduce new BrowserAgentClass as canary; route 5% traffic for low‑risk tenants
- Monitor identity SLO and WAF block rate deltas; progressive rollout with auto‑rollback
Chaos tests
- Randomly kill browser pods; ensure rescheduling and idempotent workflows
- Inject UA‑CH mismatch to verify echo SLO detection and automatic switcher correction
Incident response
- Sudden increase in WAF blocks for domain X: freeze high‑risk actions, switch to safe agent class, open investigation with captured traces
- Identity SLO burn: pause new sessions; roll back to last good agent class
Cost controls
- Pre‑warm small pools; scale to zero idle nodes per region off‑peak
- Use VPA or autoscaling per agent class density
Example custom resources and operator sketch
CRD definitions (abbreviated)
yamlapiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: name: browsersessions.agent.example.com spec: group: agent.example.com names: kind: BrowserSession plural: browsersessions scope: Namespaced versions: - name: v1 served: true storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: targetDomain: { type: string } agentClass: { type: string } geoPool: { type: string } runtimeClassName: { type: string } resources: { type: object }
Operator responsibilities
- Validate against TenantPolicy with an admission webhook (or Gatekeeper)
- Resolve geoPool → nodeSelector and egress
- Resolve agentClass → container env/flags/launch‑args
- Inject sidecar for telemetry (optional)
- Emit Kubernetes Events for session lifecycle
Practical nuances of UA and Client Hints
- Chromium UA Reduction: many sites rely on UA‑CH; test if target expects Accept‑CH and full hints; the switcher can do a priming request
- Some CDNs set different responses based on Sec‑CH‑UA‑Full; ensure consistency with userAgentMetadata
- Accept‑Language drives server content; align it with locale
- Timezone and geo: for some sites, mismatch between IP geo and JS timezone is a risk signal; your echo service should catch it
- Viewport and DPR: use sensible defaults (1366×768, DPR 1 or 2) and stay consistent per agent class
Putting it together: end‑to‑end flow
- Tenant A requests 100 sessions to crawl docs.example.org in US‑East, English.
- Operator selects chrome‑stable‑desktop‑122, geo pool us‑east, hardened=false.
- Pods launch in us‑east nodes, gVisor runtime, Playwright context configured.
- Each session hits the echo service; 99 confirm in < 30s; one mismatch triggers switcher fallback and succeeds.
- Risk engine notes two download attempts; budget at 5%—no action.
- SLOs stay green; rollout continues; traces captured for audit.
Future work
- Dynamic agent generation with differential privacy for large fleets while staying standards‑compliant
- WASM sandboxed extensions for specialized site automation, verified by policy
- Hardware‑backed attestation of sandbox runtime (e.g., confidential computing) for highly regulated tenants
Conclusion
Agentic browsers are first‑class distributed systems with identity, policy, and security as core runtime concerns. A multi‑tenant control plane on Kubernetes with a policy‑driven browser agent switcher, a "what is my browser agent" SLO, geo/locale pools, and real‑time risk budgets provides the operational substrate you need. The patterns above emphasize verifiable identity, explicit policy, controlled risk, and strong isolation. With these in place, your AI browser fleet can scale reliably—and predictably—across tenants, regions, and use cases.