Fleet Ops for Agentic Browsers: Multi‑Tenant Orchestration with Policy‑Driven Browser Agent Switching, “What Is My Browser Agent” SLOs, and Security‑Risk Budgets

Executive summary

Agentic browsers—the automated, autonomous browsing agents that drive modern AI workflows—need fleet operations, not ad‑hoc scripts. At scale, the problems look like any multi‑tenant distributed system: isolation, scheduling, quotas, SLOs, policy, observability, and security. But there is a twist: many websites drive access control, personalization, and anti‑abuse from the browser fingerprint itself, increasingly based on User‑Agent Client Hints (UA‑CH) rather than the legacy User‑Agent string.

This article proposes a concrete control‑plane design for multi‑tenant agentic browser fleets. It includes:

Kubernetes-based scheduling and pool management
Per‑tenant sandboxes with strong isolation
Quotas, SLOs, and error budgets focused on browser identity correctness
A policy‑driven Browser Agent Switcher (User‑Agent string + Client Hints, locale, timezone, device class)
A "what is my browser agent" telemetry service and SLOs
Geo/locale pools for region and language fidelity
Real‑time browser agent security risk budgets with enforcement actions

The intended audience is engineers operating large‑scale automation fleets, browser testing providers, and AI platform teams.

Why agentic browsers need fleet ops

Single instances of Puppeteer or Playwright work for demos. Production demands:

Multi‑tenancy: One platform, many tenants. Hard guarantees that agents, data, and budgets don’t bleed across tenants.
Determinism: A target site’s behavior often depends on UA/CH, locale, timezone, viewport, hardware class. You must be able to set and verify those.
Policy: Tenant‑ and destination‑aware policies decide which browser agent profile to present.
Risk management: Automated browsing touches security controls (WAFs, bot defenses, CAPTCHAs). You need a budgeted model for risky actions.
Observability: You can’t improve what you can’t measure. “What browser am I?” is not trivia; it’s an SLI.

Opinion: Context-level isolation inside a single headless browser is not enough for multi‑tenant production. Use process‑level or VM‑level isolation per tenant (or even per session at high sensitivity). This drives the architecture below.

Design goals and non‑goals

Goals

Strong tenant isolation without sacrificing density and cost control
Policy‑driven browser identity with auditable decisions
First‑class SLOs for agent identity fidelity and target‑site reachability
Real‑time risk budgets and safe‑mode fallbacks
Clean extensibility: adding a new agent profile, region, or policy is declarative

Non‑goals

Circumventing site defenses. We focus on compliant automation: correctness, reliability, and risk limits, not evasion.
Full‑stack antifingerprint research. We reference standards (W3C UA‑CH, Chromium UA Reduction) and operational best practices.

References for identity and hints

RFC 9110 (HTTP Semantics) describes the legacy User‑Agent header
Chromium’s User‑Agent Reduction initiative (ships stable): shifts identity to Client Hints
W3C UA Client Hints: Sec‑CH‑UA, Sec‑CH‑UA‑Platform, Sec‑CH‑UA‑Mobile, Sec‑CH‑UA‑Full, etc.

Architecture overview

High‑level components

Control plane
- Scheduler and operator: creates BrowserSession pods in the right geo pool with the approved BrowserAgentClass
- Policy engine (OPA/Gatekeeper): validates tenant policy, agent selection, and risk budgets
- Quota/SLO subsystem: Prometheus, Alertmanager, error budget tooling
- "What is my browser agent" telemetry service: authoritative echo + validation
- Risk engine: real‑time counters and rules; enforces safe mode
Data plane
- Browser pods: Playwright/Puppeteer/ChromeDriver under gVisor/Firecracker/Kata
- Proxy / egress tier: per‑geo IP pools; TLS pinning limits; DNS control
- Storage: ephemeral workspace + optional tenant‑scoped, encrypted object store

Simplified flow

Tenant requests a BrowserSession with intent (target domain), policies, and SLO class.
Policy picks a BrowserAgentClass (UA/CH profile, locale, timezone, viewport), geo pool, and security posture.
Scheduler starts the session in the geo pool node group with the required sandbox.
Browser boots, calls the telemetry service to verify presented identity; failure triggers switcher adjustments or session fail.
Risk engine monitors actions (downloads, file uploads, login forms, unusual JS APIs) and burns against a per‑tenant budget.
On budget breach or high burn rate, control plane enforces safe mode or blocks.

ASCII diagram

Tenant → Control API → Policy Engine → Scheduler/Operator → k8s
            ↓                 ↓                ↓             ↓
       Risk Engine      Agent Switcher     Geo/Pool      Browser Pods
            ↓                 ↓                ↓             ↓
   Telemetry + SLOs ← "What is my browser agent" ← Egress/Proxy

Kubernetes scheduling and pools

Key k8s primitives and settings

Namespaces per tenant; ResourceQuota + LimitRange for CPU/RAM/GPU
Node pools (node groups) per geo/locale, labeled for selection
Taints/tolerations to keep certain tenants in hardened pools (microVM‑backed)
PriorityClass for preemption and reserved capacity for SLO‑critical workloads
PodTopologySpread to avoid noisy neighbor and zone‑level risk concentration
RuntimeClass: gVisor or Kata (Firecracker) per security tier
PodSecurity level: restricted; seccomp, AppArmor, dropped capabilities

Pool topology

pool-us-east, pool-eu-central, pool-ap-sg with node labels like region=us-east, geo=us, locale=en-US
Hardened pool for high‑risk tenants or targets: runtimeClassName: kata, egress via isolated IP range

Example: Node labels and selection in a BrowserSession

yaml
apiVersion: agent.example.com/v1
kind: BrowserSession
metadata:
  name: session-123
  namespace: tenant-a
spec:
  targetDomain: "docs.example.org"
  agentClass: "chrome-stable-desktop-122"
  geoPool: "us-east"
  runtimeClassName: kata
  nodeSelector:
    agent.geo/region: us-east
    agent.geo/locale: en-US
  tolerations:
  - key: "agent.pool/hardened"
    operator: "Exists"
  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"

Per‑tenant sandboxes and isolation

Isolation ladder (choose per risk profile):

Process isolation: one pod per tenant with multiple browser contexts (lowest isolation)
PID/user‑ns + gVisor: syscall interception, good density and strong kernel isolation
MicroVM (Kata/Firecracker): VM boundary per pod; highest isolation

Hardening checklist

Non‑root containers; immutable root FS; drop all capabilities
Seccomp: default runtime or custom profile; AppArmor enforcement
Read‑only /home; tmpfs workdir; no hostPath mounts
Egress via eBPF/Cilium NetworkPolicies; deny‑all default
No secret env vars; use projected service account tokens; secret mounts with fsGroup

Example PodSecurityContext and container securityContext

yaml
securityContext:
  runAsNonRoot: true
  runAsUser: 10001
  runAsGroup: 10001
  fsGroup: 10001
  seccompProfile:
    type: RuntimeDefault
---
containers:
- name: browser
  image: ghcr.io/example/agent-browser:122
  securityContext:
    allowPrivilegeEscalation: false
    readOnlyRootFilesystem: true
    capabilities:
      drop: ["ALL"]

Why not multiplex tenants in one browser process? Side‑channels and defects in browser isolation are real. Given the operational cost of an incident, per‑tenant pods with gVisor or Kata is a defensible default.

Quotas, SLOs, and error budgets

Define per‑tenant quotas

Concurrency: max sessions per tenant
Runtime: max minutes per day
Egress: max outbound bytes per day and per domain
Feature flags: downloads, file dialogs, WebUSB/WebSerial disabled by default

SLIs for agentic browsers (examples)

Identity fidelity: percentage of sessions where the echo service confirms expected UA/CH/locale/timezone within 30s
Reachability: percentage of sessions reaching the target domain homepage within 60s without 4xx/5xx
Stability: session completion without crash/timeout
Safety: percentage of sessions with no high‑risk API usage

SLOs and error budgets

Identity SLO: 99.5% 30‑second confirmation
Reachability SLO: 99.0% 1‑minute success
Stability SLO: 99.2%
Safety SLO: 99.9% no high‑risk events

Prometheus recording + alerting (burn rate example)

yaml
# SLI: identity_confirmed
record: sli:identity_confirmed:ratio
expr: sum(rate(agent_identity_confirmed_total[5m])) / sum(rate(agent_sessions_started_total[5m]))
---
# SLO: 99.5%. Multi‑window burn rate alerts (fast/slow)
- alert: IdentitySLOFastBurn
  expr: (1 - sli:identity_confirmed:ratio) > (1 - 0.995) * 14
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: Fast burn for identity SLO
- alert: IdentitySLOSlowBurn
  expr: (1 - sli:identity_confirmed:ratio) > (1 - 0.995) * 6
  for: 1h
  labels: {severity: warning}
  annotations:
    summary: Slow burn for identity SLO

Policy‑driven Browser Agent Switcher

The switcher picks the presented identity at runtime:

UA string: legacy header used by many sites
UA‑CH: Sec‑CH‑UA, Sec‑CH‑UA‑Platform, Sec‑CH‑UA‑Mobile, Sec‑CH‑UA‑Arch, etc. (subject to Accept‑CH)
Accept‑Language, timezone, geolocation, viewport, device pixel ratio
Cookie mode, storage partitioning, third‑party cookie policy

Policy inputs

Tenant policy and risk tier
Target domain constraints (compatibility matrix, e.g., some sites break on reduced UA)
Geo/locale requirement
SLO history (e.g., site X has 2% mismatch; choose hardened profile)
Risk budget remaining (if low, downgrade to safe profile)

Representing agent classes

yaml
apiVersion: agent.example.com/v1
kind: BrowserAgentClass
metadata:
  name: chrome-stable-desktop-122
spec:
  engine: chromium
  version: "122.0"
  userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"
  uaClientHints:
    brands:
    - brand: "Chromium"
      version: "122"
    - brand: "Not(A:Brand)"
      version: "24"
    fullVersion: "122.0.6261.0"
    platform: "Windows"
    platformVersion: "15.0"
    architecture: "x86"
    model: ""
    mobile: false
  viewport:
    width: 1366
    height: 768
    deviceScaleFactor: 1
  timezoneId: "America/New_York"
  locale: "en-US"
  thirdPartyCookies: "block"
  features:
    downloads: false
    webusb: false

OPA policy to allow only declared agent classes per tenant

rego
package agent.policy

# input: { tenant, requestedClass, targetDomain }
allowed_classes = {"chrome-stable-desktop-122", "chrome-stable-mac-122"}

violation[msg] {
  not input.requestedClass == allowed_classes[_]
  msg := sprintf("Agent class %v not allowed for tenant %v", [input.requestedClass, input.tenant])
}

Playwright example: set UA, UA‑CH, locale, timezone, viewport

ts
import { chromium, devices } from 'playwright';

(async () => {
  const context = await chromium.launchPersistentContext('', {
    headless: true,
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
    locale: 'en-US',
    timezoneId: 'America/New_York',
    viewport: { width: 1366, height: 768 },
    permissions: [],
    extraHTTPHeaders: {
      // Some environments support UA‑CH override; Playwright passes through headers
      'Sec-CH-UA': '"Chromium";v="122", "Not(A:Brand)";v="24"',
      'Sec-CH-UA-Platform': '"Windows"',
      'Sec-CH-UA-Platform-Version': '"15.0"',
      'Sec-CH-UA-Mobile': '?0',
      'Accept-Language': 'en-US,en;q=0.9'
    }
  });

  const page = await context.newPage();
  await page.goto('https://what-is-my-browser-agent.example.com/echo');
  console.log(await page.textContent('pre#headers'));
  await context.close();
})();

Chromium CDP allows deeper UA‑CH control. Puppeteer example:

js
const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  const client = await page.target().createCDPSession();

  await client.send('Network.setUserAgentOverride', {
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36',
    acceptLanguage: 'en-US',
    platform: 'Windows',
    userAgentMetadata: {
      brands: [{ brand: 'Chromium', version: '122' }, { brand: 'Not(A:Brand)', version: '24' }],
      fullVersion: '122.0.6261.0',
      platform: 'Windows',
      platformVersion: '15.0',
      architecture: 'x86',
      model: '',
      mobile: false
    }
  });

  await page.goto('https://what-is-my-browser-agent.example.com/echo');
  await browser.close();
})();

Note: UA‑CH headers are formally sent in response to server Accept‑CH opt‑in and user privacy preferences; CDP overrides simulate metadata for automation and testing.

Decision logic sketch (switcher)

yaml
apiVersion: agent.example.com/v1
kind: TenantPolicy
metadata:
  name: tenant-a
spec:
  defaultAgentClass: chrome-stable-desktop-122
  targetOverrides:
    - domains: ["*.bank.example"]
      agentClass: chrome-stable-desktop-122
      thirdPartyCookies: block
      hardened: true
    - domains: ["*.legacy.example"]
      agentClass: chrome-legacy-desktop-114
  riskDowngrade:
    onBudgetBelowPct: 0.2
    safeAgentClass: chrome-stable-desktop-122-safe

"What is my browser agent" telemetry and SLOs

Purpose

Authoritative echo of what the outside world sees: UA, UA‑CH, Accept‑Language, IP/ASN, TLS properties, timezone via JS, screen/viewport
Compare against the selected BrowserAgentClass and geo/locale pool
Emit metrics and traces for SLOs and debugging

Echo service design

Stateless HTTP service; surfaces request headers, IP, resolved geo, and a JS payload computing timezone, devicePixelRatio, screen size
Returns a signed JSON blob so the browser agent can include it in its trace

Example echo response (truncated)

json
{
  "ip": "203.0.113.10",
  "asn": 64496,
  "geo": {"country":"US","region":"VA","city":"Ashburn"},
  "headers": {
    "user-agent": "Mozilla/5.0 ... Chrome/122.0.0.0 Safari/537.36",
    "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand)\";v=\"24\"",
    "sec-ch-ua-platform": "\"Windows\"",
    "accept-language": "en-US,en;q=0.9"
  },
  "js": {"timezone":"America/New_York","dpr":1,"screen":"1366x768"},
  "ts": 1735689600
}

Validation logic

Compare UA/CH fields with agent class
Check geo pool match (country/region) and ASN allowlist
Check timezone and locale
Produce a pass/fail and reason codes

OpenTelemetry metrics/traces

python
from opentelemetry import trace, metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "agent-echo"})
meter = MeterProvider(resource=resource).get_meter("agent-echo")
confirmed = meter.create_counter("agent_identity_confirmed_total")
failed = meter.create_counter("agent_identity_failed_total")

# When validating a session
if valid:
    confirmed.add(1, {"tenant": tenant, "agentClass": agent_class, "geo": geo})
else:
    failed.add(1, {"tenant": tenant, "reason": reason})

SLO check: Identity must be confirmed within 30 seconds of session start; otherwise the switcher retries once with an alternate agent class or fails fast for deterministic behavior.

Geo/locale pools

Reasons

Sites localize content and access based on IP and Accept‑Language
Latency matters for anti‑automation timing thresholds
Legal compliance: data residency and export control

Kubernetes placement

yaml
apiVersion: agent.example.com/v1
kind: GeoPool
metadata:
  name: us-east
spec:
  nodeSelector:
    agent.geo/region: us-east
  egressCIDRs: ["198.51.100.0/24"]
  locales: ["en-US"]

Node affinity in sessions

yaml
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: agent.geo/region
          operator: In
          values: ["us-east"]

Browser locale and timezone in Playwright

ts
const context = await chromium.launchPersistentContext('', {
  locale: 'en-US',
  timezoneId: 'America/New_York',
});

Optional: geolocation mock (when permitted) for sites that rely on the Geolocation API

ts
await context.grantPermissions(['geolocation']);
await page.setGeolocation({ latitude: 40.7128, longitude: -74.0060 });

Real‑time browser agent security risk budgets

Define budget categories

Target risk: domains with stricter ToS or sensitive surfaces
Action risk: file downloads/uploads, authentication flows, payment flows
API risk: WebUSB/WebSerial, getUserMedia, clipboard write, cross‑origin iframes with storage access
Anomaly risk: high request rate, repeated form submissions, WAF blocks, captcha triggers

Model

Each event carries a risk weight (w)
Per tenant budget B over rolling window T (e.g., 24h); track burn R = sum(w)
Enforcement thresholds: warn at 50%; degrade at 80%; block at 100%

Go snippet for a budget service

go
type Budget struct {
  Window    time.Duration
  Limit     float64
}

type Event struct {
  Tenant string
  Weight float64
  Ts     time.Time
}

// Using Redis sorted sets keyed by tenant with timestamps as scores
func AddEvent(ctx context.Context, rdb *redis.Client, e Event) error {
  key := fmt.Sprintf("risk:%s", e.Tenant)
  _, err := rdb.ZAdd(ctx, key, redis.Z{Score: float64(e.Ts.Unix()), Member: e.Weight}).Result()
  return err
}

func GetBurn(ctx context.Context, rdb *redis.Client, tenant string, window time.Duration) (float64, error) {
  now := time.Now()
  key := fmt.Sprintf("risk:%s", tenant)
  // Trim old
  rdb.ZRemRangeByScore(ctx, key, "-inf", fmt.Sprint(now.Add(-window).Unix()))
  // Sum weights
  vals, _ := rdb.ZRangeWithScores(ctx, key, 0, -1).Result()
  burn := 0.0
  for _, v := range vals { burn += v.Member.(float64) }
  return burn, nil
}

func Enforce(burn, limit float64) string {
  pct := burn / limit
  switch {
  case pct >= 1.0:
    return "block"
  case pct >= 0.8:
    return "degrade"
  case pct >= 0.5:
    return "warn"
  default:
    return "ok"
  }
}

Integration with the switcher

ok: normal profile
warn: raise telemetry level; keep profile
degrade: switch to safe agent class (no downloads, stricter cookies, hardened pool)
block: fail session creation for sensitive targets; require manual override

Prometheus counters for risk categories

yaml
- record: tenant:risk_burn:sum_rate5m
  expr: sum by(tenant) (rate(agent_risk_weight_sum[5m]))

Burn‑rate alerts tied to budgets:

yaml
- alert: RiskBudgetFastBurn
  expr: tenant:risk_burn:sum_rate5m > (tenant_risk_budget_limit * 0.2)
  for: 10m
  labels: {severity: critical}
  annotations:
    summary: Tenant is burning risk budget too fast

Safe agent class example

yaml
apiVersion: agent.example.com/v1
kind: BrowserAgentClass
metadata:
  name: chrome-stable-desktop-122-safe
spec:
  userAgent: "Mozilla/5.0 ... Chrome/122.0.0.0 Safari/537.36"
  uaClientHints:
    mobile: false
  features:
    downloads: false
    webusb: false
    clipboardWrite: false
  thirdPartyCookies: "block"
  runtimeClassName: kata

Observability and control

Metrics

agent_sessions_started_total{tenant,agentClass,geo}
agent_identity_confirmed_total{tenant,agentClass}
agent_target_reach_success_total{tenant,domain}
agent_risk_weight_sum{tenant,category}
browser_crash_total{tenant,agentClass}

Traces

One trace per session: start → switcher decision → echo check → target navigation → actions → teardown

Logs

Structured JSON; redact PII; correlate with trace IDs (W3C TraceContext)

Dashboards

Identity SLO per tenant and per agent class
Reachability by domain with quick regression detection
Risk burn and enforcement events over time

Security posture

Network and egress

Default‑deny NetworkPolicies; explicit allow for DNS and proxy
Egress IP ranges per geo pool; static ASN allowlist if required
TLS: disable MITM for session content unless explicit DLP mode is enabled with tenant consent

Container supply chain

Minimal base images; pinned, verified digests; SBOMs
Image signing (Sigstore/cosign) and admission checks
Regular patching; CI scanning for CVEs

Secrets and identity

SPIFFE IDs per pod; mTLS between pods and control plane
Kubernetes projected service account tokens with short TTL
Per‑tenant KMS keys; AES‑GCM for at‑rest storage

Browser safety defaults

Disable dangerous APIs unless explicitly allowed
No persistent cookies across sessions unless tenant‑scoped and encrypted
Clear storage and cache on teardown

Operational playbooks

Rollouts and canaries

Introduce new BrowserAgentClass as canary; route 5% traffic for low‑risk tenants
Monitor identity SLO and WAF block rate deltas; progressive rollout with auto‑rollback

Chaos tests

Randomly kill browser pods; ensure rescheduling and idempotent workflows
Inject UA‑CH mismatch to verify echo SLO detection and automatic switcher correction

Incident response

Sudden increase in WAF blocks for domain X: freeze high‑risk actions, switch to safe agent class, open investigation with captured traces
Identity SLO burn: pause new sessions; roll back to last good agent class

Cost controls

Pre‑warm small pools; scale to zero idle nodes per region off‑peak
Use VPA or autoscaling per agent class density

Example custom resources and operator sketch

CRD definitions (abbreviated)

yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: browsersessions.agent.example.com
spec:
  group: agent.example.com
  names:
    kind: BrowserSession
    plural: browsersessions
  scope: Namespaced
  versions:
  - name: v1
    served: true
    storage: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              targetDomain: { type: string }
              agentClass: { type: string }
              geoPool: { type: string }
              runtimeClassName: { type: string }
              resources: { type: object }

Operator responsibilities

Validate against TenantPolicy with an admission webhook (or Gatekeeper)
Resolve geoPool → nodeSelector and egress
Resolve agentClass → container env/flags/launch‑args
Inject sidecar for telemetry (optional)
Emit Kubernetes Events for session lifecycle

Practical nuances of UA and Client Hints

Chromium UA Reduction: many sites rely on UA‑CH; test if target expects Accept‑CH and full hints; the switcher can do a priming request
Some CDNs set different responses based on Sec‑CH‑UA‑Full; ensure consistency with userAgentMetadata
Accept‑Language drives server content; align it with locale
Timezone and geo: for some sites, mismatch between IP geo and JS timezone is a risk signal; your echo service should catch it
Viewport and DPR: use sensible defaults (1366×768, DPR 1 or 2) and stay consistent per agent class

Putting it together: end‑to‑end flow

Tenant A requests 100 sessions to crawl docs.example.org in US‑East, English.
Operator selects chrome‑stable‑desktop‑122, geo pool us‑east, hardened=false.
Pods launch in us‑east nodes, gVisor runtime, Playwright context configured.
Each session hits the echo service; 99 confirm in < 30s; one mismatch triggers switcher fallback and succeeds.
Risk engine notes two download attempts; budget at 5%—no action.
SLOs stay green; rollout continues; traces captured for audit.

Future work

Dynamic agent generation with differential privacy for large fleets while staying standards‑compliant
WASM sandboxed extensions for specialized site automation, verified by policy
Hardware‑backed attestation of sandbox runtime (e.g., confidential computing) for highly regulated tenants

Conclusion

Agentic browsers are first‑class distributed systems with identity, policy, and security as core runtime concerns. A multi‑tenant control plane on Kubernetes with a policy‑driven browser agent switcher, a "what is my browser agent" SLO, geo/locale pools, and real‑time risk budgets provides the operational substrate you need. The patterns above emphasize verifiable identity, explicit policy, controlled risk, and strong isolation. With these in place, your AI browser fleet can scale reliably—and predictably—across tenants, regions, and use cases.