DOM‑Vision Fusion for Agentic Browsers: Making Auto‑Agent AI Work on Shadow DOM, Virtualized Lists, and Canvas/WebGL UIs

Everyone wants their agent to use the web like a real person: click the right button, scroll the right list, fill the right form, and confirm completion with high confidence. That looks easy on standard HTML, but the modern web is not standard HTML. You need to handle:

Shadow DOM (including closed roots) in design systems and web components
Virtualized lists and lazy loads in React/Angular/Vue apps
Canvas and WebGL UIs (dashboards, terminals, whiteboards, editors)
Hybrid accessibility overlays with partial ARIA coverage

The point of this guide is to give you a buildable plan. We will fuse three observation planes — DOM, Accessibility (AX), and Raster — and drive the page with layout probes via the DevTools protocol. With this, your agent learns to see and act on everything the user sees, including things that do not exist in the DOM. We will also cover pragmatic network headers and user agent hints that often unlock SSR or “simplified” fallbacks.

My view: DOM‑only automation is a dead‑end on modern apps. The right foundation is DOM‑Vision fusion with AX and GPU OCR, synchronized by paint order and bounding boxes. If you get those primitives right, the rest — robust action planning — becomes straightforward.

Design goals and success metrics

Universal observability: Every user‑perceived affordance appears in at least one plane (DOM/AX/raster) with coordinates and role.
Cross‑plane alignment: You can match DOM/AX nodes to pixels with stable bounding boxes, z‑order, and paint order.
Reliable action mapping: If the agent chooses “click the button labeled Submit,” it finds a target; if DOM/AX miss it, OCR finds it, and vice versa.
Virtualization invariance: Lists and grids that recycle DOM nodes yield a stable logical list to the agent via iterative expansion.
Canvas/WebGL compatibility: You either intercept text draw calls or read pixels with OCR and map them to click positions.
Ethical fallback: Prefer standards‑compliant simplifications first (Save‑Data, mobile UA, reduced motion). Avoid spoofing restricted UAs.

Measurable metrics:

Target acquisition rate: percent of intended targets that become actionable nodes
Action success rate: percent of actions that complete the intended page transition
Latency budget: seconds per high‑level step (aim for <1–2 s with GPU OCR)
Robustness: zero stale clicks across Shadow DOM/virtualized/canvas test suites

Architecture: DOM‑Vision fusion

Think in three planes and one control loop:

DOM plane: CDP DOMSnapshot, CSS, Layout, input dispatch
AX plane: full accessibility tree for semantic roles and names
Raster plane: GPU screenshots + OCR + icon classifiers
Control loop: stability checks, paint order, and z‑index gating; action emitter with retries

The canonical fused object is a scene graph of interactable and non‑interactable regions with attributes: role, name/label, bounding rect, z‑order, stable selector, backing plane(s), and confidence scores.

Key DevTools domains to use:

DOMSnapshot: captureSnapshot (with includeDOMRects, includePaintOrder, text boxes)
Accessibility: getFullAXTree
CSS: getComputedStyleForNode
Page: captureScreenshot
Input: dispatchMouseEvent, dispatchKeyEvent
Runtime: evaluate (script injection)
Network & Emulation: to set UA, client hints, headers

Probing the DOM reliably

Do not walk the live DOM piecemeal; it changes under your feet. Use snapshot APIs to get an atomic view you can align to pixels.

Example with Playwright and a CDP session to pull DOM and AX snapshots, including shadow roots and bounding boxes:

ts
import { chromium } from 'playwright';

async function capturePageSnapshot(page) {
  const client = await page.context().newCDPSession(page);

  // DOMSnapshot with rects and paint order
  const { documents, strings } = await client.send('DOMSnapshot.captureSnapshot', {
    computedStyles: ['display', 'visibility', 'opacity', 'z-index', 'pointer-events'],
    includeDOMRects: true,
    includePaintOrder: true,
    // Some Chromium versions support this:
    // includeUserAgentShadowTree: true
  });

  // Full AX tree binds to backend DOM node ids
  const axTree = await client.send('Accessibility.getFullAXTree');

  // Screenshot for OCR alignment
  const screenshotBase64 = await client.send('Page.captureScreenshot', { format: 'png' });

  return { documents, strings, axTree, screenshotBase64 };
}

Details that matter:

Prefer DOMSnapshot.captureSnapshot over walking DOM.getDocument; it includes layout rects and paint order, which align with pixels and z‑index stacking contexts.
Collect CSS computed styles for visibility, opacity, pointer‑events — these determine if an element is actually clickable.
Accessibility.getFullAXTree returns nodes keyed to backendDOMNodeId that you can join back to DOMSnapshot nodes. AX nodes carry role, name, focusability, and hidden flags.

Handling shadow DOM

Open shadow roots are just nodes with a shadowRoot; closed roots are not directly enumerable. You have three strategies:

Snapshot path: DOMSnapshot tends to include elements rendered via shadow roots in the layout tree and text boxes, even if you cannot reach the closed shadow root DOM nodes directly. That is enough for alignment and clicking via coordinates.
Piercing traversal for open shadows: most automation frameworks provide pierce queries. In Playwright you can use locator('pierce=button') or the built‑in shadow‑piercing selectors.
Pre‑inject attachShadow hooks to capture closed roots: add a script before any page JS runs that wraps Element.prototype.attachShadow to record shadow roots regardless of mode.

Example hook via CDP script injection:

ts
await client.send('Page.addScriptToEvaluateOnNewDocument', {
  source: `
  (function(){
    const _attach = Element.prototype.attachShadow;
    const registry = new WeakMap();
    Object.defineProperty(Element.prototype, 'attachShadow', {
      value: function(init){
        const root = _attach.call(this, init);
        try { registry.set(this, root); } catch {}
        return root;
      }
    });
    Object.defineProperty(window, '__shadowRegistry', {
      value: registry,
      configurable: false
    });
  })();
  `
});

You can then expose functions to traverse registry entries when needed. Even if you cannot inspect closed roots fully, you can often interact via bounding boxes derived from layout snapshots.

Iframes, portals, and multiple targets

Use Target.setDiscoverTargets and auto‑attach to child targets so you can snapshot and OCR each browsing context (top frame, iframes, fenced frames where permitted). Draw an overlay graph that includes frame bounds so clicks map into the correct coordinate space.

Building the fused scene graph

Represent every visible entity as a node:

id: stable id you control
planes: set of { dom, ax, raster }
bounds: client rect in CSS pixels
paintIndex: integer from DOMSnapshot for z‑order disambiguation
role: semantic role (AX‑derived or vision‑derived)
name: accessible name or OCR text
interactable: boolean + reason
selector: CSS/XPath/AX path when available
actions: click, type, select, drag, hover
confidence: 0..1 per attribute and overall

Alignment steps:

From DOMSnapshot, create base nodes with bounds, paintIndex, visibility.
Join AX by backendDOMNodeId; carry role and name.
Run OCR over screenshot to get text boxes; merge with DOM/AX by IoU and paint order.
Resolve conflicts: if OCR finds text on a canvas but no AX match, keep a raster‑only node; if both exist, prefer AX name and DOM interactability.
Filter non‑interactable nodes by visibility, pointer‑events, opacity, and occlusion.

Heuristics that work:

IoU threshold: 0.5 is a good starting point to merge OCR text into a DOM/AX node.
Stacking context: prefer nodes with higher paintIndex when multiple overlaps claim the same text.
Interactability: clickable if role in { button, link, checkbox, radio, tab, menuitem } or tag in { button, a[href], input } with visible pointer events.

Virtualized lists and lazy loads

Your agent must treat lists as logical sequences, not static DOM collections. Frameworks like React Window and Virtualized recycle nodes; only a slice is mounted.

Detection signals:

Container with scrollable overflow, yet few children relative to claimed rowcount (aria‑rowcount, aria‑setsize) or visible item indexes in data attributes.
Scroller height vs. content height mismatch; items jump in and out when scrolling.
IntersectionObserver callbacks doing content swaps.

Expansion strategy:

Build a logical list cache keyed by a stable feature: text contents + structural hints + approximate index.
Scroll deterministically in steps equal to viewport height or item height when detectable.
After each scroll, wait for network idle and DOM stabilization; snapshot and harvest items into cache.
De‑dup with fuzzy matching (token Jaccard or hash of normalized text) to avoid counting recycled nodes twice.

Example implementation sketch:

ts
type ListItem = { key: string; bounds: {x:number,y:number,w:number,h:number}; text: string; indexApprox?: number };

async function harvestVirtualList(page, scrollerSelector, maxItems = 500) {
  const seen = new Map();
  const client = await page.context().newCDPSession(page);
  const scroller = page.locator(scrollerSelector);

  async function capture() {
    const snap = await client.send('DOMSnapshot.captureSnapshot', {
      computedStyles: ['display','visibility','opacity'],
      includeDOMRects: true,
      includePaintOrder: true
    });
    // Parse snap to extract items under scroller bounds (left as an exercise)
    return [] as ListItem[];
  }

  let pos = 0;
  while (seen.size < maxItems) {
    const items = await capture();
    for (const it of items) {
      const key = it.text.toLowerCase().replace(/\s+/g, ' ').slice(0, 200);
      if (!seen.has(key)) seen.set(key, it);
    }
    const before = seen.size;
    await scroller.evaluate((el, y) => el.scrollBy(0, y), page.viewportSize()?.height ?? 600);

    // Wait for quiet
    await page.waitForTimeout(200);
    await waitForNetworkIdle(client, 300);

    if (seen.size === before) pos += 1; else pos = 0;
    if (pos > 5) break; // heuristically stop if nothing new after several steps
  }
  return [...seen.values()];
}

async function waitForNetworkIdle(client, idleMs) {
  let pending = 0;
  let idleResolve;
  await client.send('Network.enable');
  client.on('Network.requestWillBeSent', () => pending++);
  client.on('Network.loadingFinished', () => pending--);
  client.on('Network.loadingFailed', () => pending--);
  return new Promise<void>(res => {
    let timer;
    const tick = () => {
      clearTimeout(timer);
      if (pending === 0) timer = setTimeout(() => res(), idleMs);
      else timer = setTimeout(tick, 50);
    };
    tick();
  });
}

Add a guard against infinite scroll that loads endless pages; cap total items and stop after N idle cycles without new content.

Lazy loads:

Trigger visibility by scrolling just enough to reveal target regions; use IntersectionObserver in a small injected helper to notify when target selectors enter viewport.
Simulate user input (wheel, pointermove) to wake ad hoc lazy logic.
For images, set Save‑Data and Width/DPR client hints to encourage smaller assets and quicker load.

Canvas and WebGL UIs

Canvas UIs split into two categories: those that draw text with 2D APIs, and those that render glyph atlases or SDF in WebGL. Treat them differently.

Instrumenting 2D canvas text

Patch fillText and strokeText to collect text, coordinates, and font info. Do this before page scripts run.

ts
await client.send('Page.addScriptToEvaluateOnNewDocument', {
  source: `
  (function(){
    const ctx2d = CanvasRenderingContext2D.prototype;
    function record(call, args, ctx){
      try { window.__canvasText = window.__canvasText || []; } catch {}
      const entry = { call, text: String(args[0]), x: args[1], y: args[2], font: ctx.font, fillStyle: ctx.fillStyle };
      try { window.__canvasText.push(entry); } catch {}
    }
    const f = ctx2d.fillText;
    const s = ctx2d.strokeText;
    ctx2d.fillText = function(text, x, y, maxWidth){ record('fillText', arguments, this); return f.apply(this, arguments); };
    ctx2d.strokeText = function(text, x, y, maxWidth){ record('strokeText', arguments, this); return s.apply(this, arguments); };
  })();
  `
});

Periodically pull window.__canvasText and map entries to pixel coordinates via the canvas DOMRect. You then create raster‑backed nodes with role label and clickability inferred by hover or pointer event handlers attached to the canvas region.

Limitations: does not catch WebGL text or images tainting the canvas; still valuable for terminal‑like apps and 2D dashboards.

Sampling WebGL

Generic WebGL interception is hard. Practically:

Read pixels and run OCR on the canvas region. Use CDP Page.captureScreenshot to crop the canvas rect and feed the OCR pipeline.
If same‑origin and untainted, inject gl.readPixels after frame renders by wrapping requestAnimationFrame and a small WebGL helper; otherwise, raster screenshot is the fallback.
Create typeable click targets by watching pointer event handlers on the canvas element and sampling hover effects.

For many WebGL UIs, text is high‑contrast and OCR‑friendly. Treat recognized tokens as nodes with bounds and actions defined by pointer event deltas you learn from observing the app (hover color change, tooltip appearance).

GPU OCR pipeline

You need fast, accurate OCR with multi‑language support. An effective stack:

Text detection: DBNet or Differentiable Binarization family
Text recognition: CRNN, SAR, or transformer‑based recognizers
Framework: ONNX Runtime with CUDA (Linux/Windows) or DirectML (Windows) or OpenVINO (CPU) when GPU is absent

Key performance tricks:

Tile large screenshots into overlapping patches to keep GPU memory reasonable.
Batch patches; keep models warm in a worker process.
Merge overlapping boxes with NMS; order text blocks by reading order (top‑to‑bottom, left‑to‑right, small hysteresis for columns).

Minimal Python example using ONNX Runtime for PP‑OCRv3‑like models:

python
import onnxruntime as ort
import numpy as np
from PIL import Image

class Ocr:
    def __init__(self, det_path, rec_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']):
        self.det_sess = ort.InferenceSession(det_path, providers=providers)
        self.rec_sess = ort.InferenceSession(rec_path, providers=providers)

    def detect(self, img):
        # Preprocess to model input (normalize, resize); omitted for brevity
        inp = np.expand_dims(img, 0).astype(np.float32)
        out = self.det_sess.run(None, {'images': inp})
        # Postprocess to boxes; omitted
        return boxes  # list of (x1,y1,x2,y2)

    def recognize(self, img, boxes):
        texts = []
        for box in boxes:
            x1,y1,x2,y2 = map(int, box)
            crop = img[y1:y2, x1:x2]
            # Preprocess; omitted
            out = self.rec_sess.run(None, {'images': np.expand_dims(crop, 0)})
            text = self.decode(out)
            texts.append((box, text))
        return texts

    def decode(self, out):
        # Greedy/beam decode; omitted
        return 'text'

if __name__ == '__main__':
    img = np.array(Image.open('frame.png').convert('RGB'))
    ocr = Ocr('dbnet.onnx', 'crnn.onnx')
    boxes = ocr.detect(img)
    results = ocr.recognize(img, boxes)
    for box, text in results:
        print(box, text)

Integrate with the fused scene graph by converting OCR boxes to CSS pixel bounds and merging via IoU with DOM/AX nodes. Unmatched boxes become raster‑only nodes with role inferred from context (e.g., text near a rounded rectangle or icon).

Layout probes: visibility, occlusion, and input mapping

Visibility and z‑order determine whether clicks land. Pull enough style to decide:

display not none, visibility not hidden, opacity > 0.01
pointer‑events not none
topmost at the click point by paint order

Use CSS.getComputedStyleForNode for exact values and DOMSnapshot paint order to resolve overlap. Probe occlusion by injecting a temporary overlay and using document.elementFromPoint.

Example click mapping with CDP input dispatch:

ts
async function clickAt(page, x, y) {
  const client = await page.context().newCDPSession(page);
  await client.send('Input.dispatchMouseEvent', { type: 'mouseMoved', x, y, modifiers: 0, buttons: 0 });
  await client.send('Input.dispatchMouseEvent', { type: 'mousePressed', x, y, button: 'left', clickCount: 1 });
  await client.send('Input.dispatchMouseEvent', { type: 'mouseReleased', x, y, button: 'left', clickCount: 1 });
}

async function elementTopmostAt(page, x, y) {
  return await page.evaluate(({x,y}) => {
    const el = document.elementFromPoint(x, y);
    if (!el) return null;
    const r = el.getBoundingClientRect();
    return { tag: el.tagName, id: el.id, className: el.className, rect: {x: r.x, y: r.y, w: r.width, h: r.height} };
  }, {x,y});
}

Before dispatching, your agent can assert that the intended node is topmost within epsilon; if not, scroll or adjust the click.

Paint order sanity check: highlight candidates using Overlay.highlightNode to debug why the wrong element takes the click.

Putting it together: the control loop

A minimal high‑level step to “click the Submit button”:

Snapshot DOM/AX + screenshot.
Build fused graph; find candidate nodes with role button and name matching Submit.
If none, use OCR to find a text box with text Submit; search nearby boxes for a rounded rectangle or known button patterns or a raster node itself as the target.
For each candidate, verify interactability; compute a click point (center or saved hotspot).
Dispatch click; wait for network idle + layout stabilization.
Re‑snapshot; confirm state change: new URL, dialog closed, toast appeared, etc.
If not confirmed, try next candidate or escalate with a different fallback (scroll into view, mobile UA, Save‑Data).

Triggering simpler fallbacks with UA and client hints

When the fused graph cannot surface reliable affordances quickly (heavy canvas, deep virtualization), prompt the server or client to deliver a simpler version. Prefer opt‑in standards first.

Safe, effective headers and overrides:

Save‑Data: on — many sites serve lighter DOM and fewer lazy assets
Reduced motion: prefers‑reduced‑motion via client hints or CSS media emulation
Mobile UA with smaller viewport — responsive sites often collapse into simpler, more semantic markup
Downlink and RTT hints — sometimes influence content selection

CDP examples:

ts
await client.send('Emulation.setUserAgentOverride', {
  userAgent: 'Mozilla/5.0 (Linux; Android 12; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36',
  acceptLanguage: 'en-US,en;q=0.9',
  platform: 'Android'
});

await client.send('Network.enable');
await client.send('Network.setExtraHTTPHeaders', {
  headers: {
    'Save-Data': 'on',
    'Sec-CH-UA-Mobile': '?1',
    'Sec-CH-UA-Platform': '"Android"',
    // Width/DPR hints encourage responsive images and sometimes simpler layouts
    'DPR': '1.0',
    'Viewport-Width': '375'
  }
});

// Prefer reduced motion in rendering
await client.send('Emulation.setEmulatedMedia', { features: [{ name: 'prefers-reduced-motion', value: 'reduce' }] });

Caveats and ethics:

Do not spoof restricted bots (e.g., Googlebot) to bypass policies or paywalls.
Respect robots.txt and site TOS; throttle your agent and avoid hammering endpoints.
Many sites use UA sniffing poorly; test fallbacks and bail out if they break essential flows.

Other hints that sometimes help:

Disable WebGL via command line flags if you control the browser instance; sites may fall back to DOM. Example Chromium flags: --disable-accelerated-2d-canvas, --disable-gpu (use judiciously).
Force color scheme to light or high contrast to improve OCR accuracy.

Walkthrough outline:

Navigate and snapshot; DOM has a custom element my-search with a closed shadow root; AX shows a textbox role with name Search.
Build fused node: role textbox, name Search, bounds from layout.
Focus and type with Input.dispatchKeyEvent at center of bounds; verify text input by reading AX value.
Results appear in a virtualized grid; aria‑rowcount is 10,000 with only ~30 children.
Enter expansion mode: scroll the grid container, snapshot after each scroll, add unique rows to logical cache until at least 200 items.
OCR over screenshot to read off‑DOM canvas‑rendered price badges; merge those into each row’s node as attributes.
Choose target row by predicate (price < X), click the row; confirm navigation by URL change and page title update.

The sequence uses DOM, AX, and OCR together; each fills a gap the others leave.

Performance and stability tips

Throttle snapshots: take them only when layout settles (debounce via requestAnimationFrame sampling or Performance timeline markers).
Use paint order to resolve click targets deterministically; don’t rely on z-index textual values alone.
Keep OCR tiles small (e.g., 1024×1024) and batch 4–8 per GPU; cache results while the viewport doesn’t change.
Pre‑warm models at agent start and reuse sessions.
Use overlay debugging liberally during development: draw boxes for DOM nodes, AX nodes, and OCR boxes with different colors to see merge alignment issues.

Overlay debug example:

ts
await client.send('Overlay.enable');
await client.send('Overlay.setShowPaintRects', { result: true });
// Or highlight a specific nodeId
await client.send('Overlay.highlightNode', { highlightConfig: { showInfo: true, contentColor: {r: 111, g: 168, b: 220, a: 0.3} }, nodeId });

Evaluation plan

Build a small benchmark suite with pages that deliberately exercise hard cases:

Web components: open and closed shadow roots with inputs and buttons
Virtualized list: 50k rows with infinite scroll and sticky headers
Canvas dashboard: 2D gauges and labels; WebGL chart with SDF text
Lazy assets: images and content behind IntersectionObserver

Metrics to collect:

Time to find and click a labeled button in Shadow DOM
Items harvested per second from virtualized list; duplication rate
OCR precision/recall on canvas labels; end‑to‑end action success
Degradation improvements when Save‑Data is on and mobile UA is used

Run with and without fusion to quantify the lift.

Security, ethics, and site respect

Limit input rates and concurrency; treat your agent like a real user with human‑like pacing.
Honor robots.txt where applicable; avoid scraping prohibited paths.
Don’t exploit UA spoofing. Prefer standards‑based hints that communicate your intent (reduced data/motion) and responsive UAs.
Protect user data; do not hoard screenshots beyond immediate OCR needs; blur PII in debug logs.

Conclusion

Agentic browsing on today’s web needs a fused approach. DOM and AX get you semantics and cheap selectors. OCR on raster fills in the gaps for shadowed, virtualized, and canvas‑heavy UIs. Layout probes — bounding boxes, paint order, and visibility — make the action space stable. Finally, pragmatic UA and client hints can simplify pages before you resort to heavy vision.

Build your agent around a scene graph that merges these planes, and you’ll see the reliability jump from brittle demos to production‑grade autonomy.

References and pointers

Chrome DevTools Protocol: DOMSnapshot, Accessibility, Page, Input, CSS, Overlay
W3C ARIA and Accessibility Object Model docs
PP‑OCR and DBNet papers and open implementations
React Virtualized, React Window source for virtualization patterns
Playwright and Puppeteer docs on shadow‑piercing selectors and CDP sessions