surfs.dev Logo
surfs.dev
ResourcesNews

Label. Train. Deploy.
Browser agents.

Your agent executes 50 actions. One fails. Surfs tells you exactly which one and why.

Without Surfs
Test failed: Selector not found
Which of the 50 actions failed?
Spend hours debugging blind
With Surfs
Action #37 failed
click("button.add-to-cart")
Element moved during scroll
Know exactly what broke
Fix in seconds, not hours
Request Early Access
The Problem

Browser agents fail. You have no idea why.

Training LLMs is solved. Training browser agents? That's the frontier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Action sequence 1-50: Which step failed? Traditional testing can't tell you.

No training data infrastructure

  • •Training data for text and images is solved. Action sequences? Not even close.
  • •No way to label which actions succeeded vs failed in a 50-step sequence
  • •Can't build datasets without capture + annotation tooling
  • •You're stuck with whatever the LLM gives you

No baseline for improvement

  • •GPT-4 gets 60% success rate. Is Claude better? You have no idea.
  • •Changed the prompt - did it help or hurt? No way to measure.
  • •Can't A/B test models without deterministic replay
  • •Every 'improvement' is a guess

Missing training & eval layer

  • •Playwright/Puppeteer say 'test failed' - but which of 50 actions broke?
  • •No action-level pass/fail. Just end-to-end success or failure.
  • •Can't run systematic evals without sandboxed environments
  • •Traditional testing tools weren't built for agents

Built by the team behind Debugg.ai (800+ users)

The missing layer between automation and production

Works with Playwright, Puppeteer, Selenium, browser-use, or any browser automation framework

Your Framework
Playwright, Puppeteer, etc.
Surfs
Label, train, evaluate
Production
Ship with confidence

Example: E2E Test Workflow

E2E Run Requested
trigger
Setup Browser Session
browser
Execute Test via Surfer
surfer
Teardown Browser Session
browser
Workflow Active • v1

Framework-agnostic. Works with your existing automation stack.

Five core capabilities

Label action sequences

Build training datasets

Systematic evaluation

Version control for agent behavior

Safe deployment

Like session replay (LogRocket/FullStory) but for agents

Capture what your agent actually did

Action #37 failed. But what was the DOM state? Which element did it click? What was visible on screen?

1. Capture Session

capture.ts
// Capture agent session with full tracing
import { Surfer } from '@surfs-dev/sdk'

const session = await Surfer.capture({
  task: "Complete checkout flow",
  agent: yourAgent,
  captureScreenshots: true,
  captureDOMSnapshots: true
})

// Session automatically traces every action
console.log(session.id) // "sess_abc123"

2. Annotate Actions

annotate.ts
// Review and annotate actions
const trace = await Surfer.getTrace(session.id)

// Mark specific actions as correct/incorrect
await trace.actions[37].annotate({
  label: "incorrect",
  reason: "Clicked 'Delete' instead of 'Cancel'",
  correctAction: {
    type: "click",
    selector: "button[data-action='cancel']"
  }
})

// Export annotated session
const labeled = await session.export()
// Returns action sequence with human feedback

Session Replay UI

Action Timeline

50 actions total
#1
navigate10:30:15
https://example.com
#2
click10:30:16
button.accept-cookies
#3
click10:30:17
a[href='/products']
#4
wait10:30:18
.product-grid
#5
click10:30:19
button.add-to-cart
#37
click10:30:45
button.delete-account
Element moved during scroll - Expected Cancel button, found Delete button
... 44 more actions

Action #37 Details

Selector
button.delete-account
Error
Element moved during scroll
Expected
Cancel button
Annotation

Trace Data Format

trace-action.json~2KB per action
{
  "session_id": "sess_abc123",
  "task": "Complete checkout flow",
  "timestamp": "2026-02-11T10:30:00Z",
  "actions": [
    {
      "action_id": "act_37",
      "step": 37,
      "type": "click",
      "selector": "button.delete-account",
      "timestamp": "2026-02-11T10:30:45.234Z",
      "screenshot_url": "s3://...",
      "dom_snapshot": "...",
      "success": false,
      "error": {
        "type": "wrong_element",
        "message": "User intended to click Cancel"
      },
      "label": "incorrect",
      "human_feedback": {
        "annotated_by": "user_123",
        "reason": "Wrong button - should cancel not delete",
        "correct_selector": "button[data-action='cancel']"
      }
    }
  ],
  "outcome": "failed",
  "total_actions": 50,
  "labeled_actions": 50
}

Turn failed agent runs into training data. Every mistake becomes a labeled example of what NOT to do.

OpenAI fine-tuning format out of the box

Export labeled data. Train your models.

147 labeled sessions. 3,421 annotated actions. How do you turn that into training data?

Export API

export-dataset.ts
// Export labeled sessions for training
import { Surfer } from '@surfs-dev/sdk'

const dataset = await Surfer.exportDataset({
  sessionIds: ["sess_abc123", "sess_def456"],
  format: "openai-jsonl",  // Compatible with OpenAI fine-tuning
  includeCorrectActions: true,
  includeIncorrectActions: true,
  includeScreenshots: false
})

// Download JSONL file
await dataset.download("training-data.jsonl")

// Or stream to your training pipeline
for await (const batch of dataset.stream()) {
  await yourTrainingPipeline.ingest(batch)
}

Dataset Configuration

dataset-config.ts
// Configure dataset creation
const config = {
  filter: {
    outcome: ["failed", "partial_success"],
    labeledOnly: true,
    dateRange: {
      start: "2026-01-01",
      end: "2026-02-11"
    }
  },
  transform: {
    // Map to your model's format
    systemPrompt: "You are a browser automation agent",
    includeContext: {
      previousActions: 3,  // Include N previous actions
      domContext: true     // Include DOM state
    }
  },
  split: {
    train: 0.8,
    validation: 0.1,
    test: 0.1
  }
}

Agent Dashboard

Total Agents
3
3 active
Total Runs
294
All time
Avg Success Rate
N/A
Across all agents

All Agents

3 agents
Browser Use Test Creation Agent
Project: debugg-ai-mcp
active
Total Runs
0
Success Rate
N/A
Last Modified
3 months ago
E2E Test Runner
Project: production
active
Total Runs
147
Success Rate
92%
Last Modified
2 days ago
Checkout Flow Validator
Project: e-commerce
active
Total Runs
147
Success Rate
87%
Last Modified
1 week ago

OpenAI JSONL Format

training-data.jsonlCompatible with OpenAI fine-tuning API
{"messages": [{"role": "system", "content": "You are a browser agent"}, {"role": "user", "content": "Click the Cancel button in the modal"}, {"role": "assistant", "content": "{\"action\": \"click\", \"selector\": \"button[data-action='cancel']\"}"}], "metadata": {"session_id": "sess_abc123", "action_id": "act_37", "label": "correct", "task": "Checkout flow"}}
{"messages": [{"role": "system", "content": "You are a browser agent"}, {"role": "user", "content": "Click the Cancel button in the modal"}, {"role": "assistant", "content": "{\"action\": \"click\", \"selector\": \"button.delete-account\"}"}], "metadata": {"session_id": "sess_abc123", "action_id": "act_36", "label": "incorrect", "reason": "Wrong button - user wanted cancel not delete"}}
{"messages": [{"role": "system", "content": "You are a browser agent"}, {"role": "user", "content": "Type email address into login form"}, {"role": "assistant", "content": "{\"action\": \"type\", \"selector\": \"input[type='email']\", \"value\": \"user@example.com\"}"}], "metadata": {"session_id": "sess_def456", "action_id": "act_12", "label": "correct"}}

OpenAI Fine-Tuning Integration

• Drop-in compatible with openai.File.create()

• Supports both chat completion and completion formats

• Automatic train/validation/test splits

• Metadata preserved for experiment tracking (Weights & Biases, MLflow, etc.)

Scale AI for action sequences. Turn agent failures into training data with one API call.

Deterministic replay like Playwright Test but for AI agents

Action-level pass/fail. Sandboxed environments.

Your agent passes tests on Friday. Fails on Monday. Which action regressed?

Evaluation Harness

eval-harness.ts
// Define evaluation harness
import { Surfer, EvalHarness } from '@surfs-dev/sdk'

const harness = new EvalHarness({
  name: "Checkout Flow Eval",
  sandboxed: true,  // Isolated test environment
  deterministic: true  // Same DOM state every run
})

// Add test cases
harness.addTests([
  {
    name: "successful_checkout",
    task: "Complete checkout with test credit card",
    assertions: [
      { action: "click", selector: "button.checkout" },
      { action: "type", selector: "input[name='card']", value: "4242..." },
      { action: "click", selector: "button.submit" },
      { finalState: "url", contains: "/confirmation" }
    ]
  }
])

// Run evaluation
const results = await harness.run(yourAgent)

Action-Level Assertions

assertions.ts
// Action-level assertions
const results = await Surfer.eval({
  agent: yourAgent,
  task: "Add item to cart",
  assertions: {
    // Assert specific actions were taken
    actions: [
      {
        type: "click",
        selector: "button.add-to-cart",
        mustOccur: true,
        beforeAction: 10  // Must happen in first 10 actions
      }
    ],
    // Assert final state
    finalState: {
      url: { contains: "/cart" },
      dom: { selector: ".cart-item", count: 1 },
      localStorage: { key: "cartItems", contains: "product-123" }
    },
    // Assert no unwanted actions
    forbidden: [
      { type: "click", selector: "button.delete-account" },
      { type: "navigate", url: { contains: "/admin" } }
    ]
  }
})

// Results include action-level pass/fail
console.log(results.actions[5].passed) // true/false
console.log(results.actions[5].assertion) // Which assertion failed

Eval Results Dashboard

Evaluation Results

Checkout Flow Evalv2.1.0
Total Tests
12
Passed
10
Failed
2
Success Rate
83.3%
2 regressions detected
Agent behavior changed from baseline v1.2
successful_checkout
Failed at action #5
Actions
8
Baseline
v1.2
Current
v2.1.0
Action Timeline
Action #5 Failed
Expected
click("button.submit")
Actual
click("button.cancel")
Regression from v1.2 - action #5 changed behavior
add_to_cart
Actions
6
Baseline
v1.2
Current
v2.1.0
product_search
Actions
4
Baseline
v1.2
Current
v2.1.0
+ 9 more tests

Eval Results Format

eval-results.jsonIncludes regression detection
{
  "eval_id": "eval_xyz789",
  "name": "Checkout Flow Eval",
  "agent": "gpt-4-1106-preview",
  "timestamp": "2026-02-11T15:45:00Z",
  "overall": {
    "passed": false,
    "total_tests": 12,
    "passed_tests": 10,
    "failed_tests": 2,
    "success_rate": 0.833
  },
  "tests": [
    {
      "test_id": "test_001",
      "name": "successful_checkout",
      "passed": false,
      "actions_taken": 8,
      "failed_action": {
        "action_id": "act_5",
        "step": 5,
        "type": "click",
        "selector": "button.submit",
        "expected": "click on submit button",
        "actual": "clicked button.cancel",
        "error": "Wrong element clicked - regression from v1.2"
      },
      "assertions": {
        "passed": 4,
        "failed": 1,
        "failures": [
          {
            "type": "finalState",
            "expected": "url contains /confirmation",
            "actual": "url is /checkout (stuck on checkout page)"
          }
        ]
      }
    }
  ],
  "regressions": [
    {
      "test_id": "test_001",
      "baseline_version": "v1.2",
      "regression_type": "action_mismatch",
      "details": "Agent now clicks Cancel instead of Submit at step 5"
    }
  ]
}

Deterministic Replay

• Sandboxed environment: Same DOM state, same network responses, every run

• Action-level diffs: See exactly which action changed between versions

• Baseline comparison: Compare against previous successful runs (v1.2, v1.1, etc.)

• CI/CD integration: Block deploys on regressions, auto-rollback on failure

Know exactly which action regressed. No more "it worked yesterday" mysteries.

Like git but for agent behavior

A/B test models. Track regressions.

Is GPT-4 better than Claude? V2 better than V1? Without data, you're guessing.

A/B Testing

ab-test.ts
// A/B test different models
import { Surfer, ABTest } from '@surfs-dev/sdk'

const test = new ABTest({
  name: "GPT-4 vs Claude Sonnet",
  variants: [
    {
      name: "gpt-4",
      agent: new Agent({ model: "gpt-4-1106-preview" }),
      traffic: 0.5  // 50% of tests
    },
    {
      name: "claude-sonnet",
      agent: new Agent({ model: "claude-sonnet-4" }),
      traffic: 0.5
    }
  ],
  evalHarness: checkoutFlowTests,
  minSampleSize: 100  // Run 100 tests per variant
})

// Run A/B test
const results = await test.run()

// Statistical comparison
console.log(results.winner) // "claude-sonnet"
console.log(results.confidence) // 0.95 (95% confident)
console.log(results.improvement) // "+12% success rate"

Version Control

versioning.ts
// Version your agent like code
const agent = await Surfer.deploy({
  agent: yourAgent,
  version: "v2.1.0",
  commit: "abc123def",
  changelog: "Fixed cart button selector regression",
  compareWith: "v2.0.0"  // Auto-compare with previous version
})

// Get version history
const history = await Surfer.getVersions()
// [
//   { version: "v2.1.0", success_rate: 0.92, deployed: "2026-02-11" },
//   { version: "v2.0.0", success_rate: 0.87, deployed: "2026-02-10" },
//   { version: "v1.9.0", success_rate: 0.91, deployed: "2026-02-09" }
// ]

// Rollback to previous version
await Surfer.rollback("v2.0.0")

Model Comparison Dashboard

Model A/B Test Results

100 runs per variant95% confidence

GPT-4

gpt-4-1106-preview
v2.0.0
Success Rate
87%
Avg Actions
12.3
Cost/Run
$0.042
Avg Duration
8.4s

Claude Sonnet

claude-sonnet-4
v2.1.0
Success Rate
92%
+5.7%
Avg Actions
11.8
-4.1%
Cost/Run
$0.038
-9.5%
Avg Duration
7.7s
-9.1%
Winner: Claude Sonnet
+5.7% success rate, faster, and cheaper
Success Rate Over Time
GPT-4
Claude Sonnet

Comparison Results Format

comparison.jsonStatistical analysis included
{
  "comparison_id": "cmp_abc123",
  "variant_a": {
    "name": "gpt-4",
    "model": "gpt-4-1106-preview",
    "version": "v2.0.0",
    "tests_run": 100,
    "success_rate": 0.87,
    "avg_actions": 12.3,
    "avg_duration_ms": 8420,
    "cost_per_run": 0.042
  },
  "variant_b": {
    "name": "claude-sonnet",
    "model": "claude-sonnet-4",
    "version": "v2.1.0",
    "tests_run": 100,
    "success_rate": 0.92,
    "avg_actions": 11.8,
    "avg_duration_ms": 7650,
    "cost_per_run": 0.038
  },
  "statistical_analysis": {
    "winner": "claude-sonnet",
    "confidence": 0.95,
    "p_value": 0.023,
    "effect_size": 0.05,
    "improvement": {
      "success_rate": "+5.7%",
      "actions": "-4.1%",
      "duration": "-9.1%",
      "cost": "-9.5%"
    }
  },
  "action_level_diff": [
    {
      "action_step": 5,
      "gpt4_action": "click button.add-to-cart",
      "claude_action": "click button.add-to-cart",
      "difference": "none"
    },
    {
      "action_step": 7,
      "gpt4_action": "type input#quantity value='1'",
      "claude_action": "skip (quantity already 1)",
      "difference": "claude more efficient",
      "impact": "saved 1 action"
    }
  ],
  "regressions": [],
  "recommendation": "Deploy claude-sonnet to production"
}

Statistical Rigor

• Proper A/B testing: Statistical significance (p-value, confidence intervals)

• Action-level diffs: See exactly which actions differ between models

• Performance metrics: Success rate, speed, cost per run

• Version history: Track improvements (or regressions) over time

• One-click rollback: Revert to any previous version instantly

No more "V2 feels worse." Get data. Make informed decisions. Deploy with confidence.

Like Vercel/Railway deployment but with eval gates

Test in staging. Deploy with confidence.

Ship agent changes to production. Success rate tanks. Customers complain. You scramble to rollback.

Safe Deployment

deploy.ts
// Deploy with safety checks
import { Surfer } from '@surfs-dev/sdk'

const deployment = await Surfer.deploy({
  agent: yourAgent,
  version: "v2.1.0",

  // Staging tests (run before production)
  staging: {
    evalHarness: checkoutFlowTests,
    minSuccessRate: 0.90,  // Block deploy if < 90%
    minSampleSize: 50      // Run 50 tests in staging
  },

  // Canary deployment (gradual rollout)
  canary: {
    enabled: true,
    initialTraffic: 0.05,  // Start with 5% traffic
    rampUpDuration: "2h",  // Gradually increase over 2 hours
    monitorMetrics: ["success_rate", "error_rate"]
  },

  // Auto-rollback triggers
  rollback: {
    onSuccessRateBelow: 0.85,
    onErrorRateAbove: 0.10,
    compareWith: "v2.0.0"  // Rollback to this version
  }
})

// Monitor deployment
console.log(deployment.status) // "canary", "rolling_out", "complete"
console.log(deployment.currentTraffic) // 0.05 -> 0.25 -> 0.50 -> 1.0

CI/CD Integration

.github/workflows/deploy.yml
// GitHub Actions / CI/CD integration
name: Deploy Agent

on:
  push:
    branches: [main]

jobs:
  test-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Run Surfs Eval
        run: |
          npx surfs eval \
            --harness ./tests/checkout-flow.ts \
            --min-success-rate 0.90 \
            --output results.json

      - name: Block if tests fail
        run: |
          SUCCESS_RATE=$(jq '.success_rate' results.json)
          if (( $(echo "$SUCCESS_RATE < 0.90" | bc -l) )); then
            echo "Tests failed: $SUCCESS_RATE < 0.90"
            exit 1
          fi

      - name: Deploy to production
        if: success()
        run: |
          npx surfs deploy \
            --version ${{ github.sha }} \
            --canary \
            --auto-rollback

Deployment Pipeline

Deployment Pipeline

v2.1.0 → Production
Deploying...

Staging Tests

2m 15s
Success Rate
92%
Threshold: 90%
Error Rate
3%
Threshold: 10%
✓ PASSED

Canary 5%

5%
30m
Success Rate
91%
Threshold: 90%
Error Rate
3%
Threshold: 10%
✓ PASSED

Canary 25%

25%
12m / 30m
Success Rate
91%
Threshold: 90%
Error Rate
4%
Threshold: 10%
Monitoring progress40% complete
⟳ IN PROGRESS

Canary 50%

50%

Production 100%

100%
Real-time Metrics (Last 30 minutes)
30 min agoNow
Auto-rollback enabled if success rate drops below 85%

Deployment Configuration

surfs.ymlDeployment as code
# surfs.yml - Deployment configuration
version: 2

# Staging environment
staging:
  url: https://staging.example.com
  eval_harness: ./tests/all-flows.ts
  success_threshold: 0.90
  sample_size: 100

# Production deployment strategy
production:
  strategy: canary
  canary:
    initial_traffic: 0.05
    increment: 0.15
    interval: 30m
    max_traffic: 1.0

  # Health checks
  health_checks:
    - metric: success_rate
      min: 0.85
      window: 5m
    - metric: error_rate
      max: 0.10
      window: 5m
    - metric: avg_duration_ms
      max: 10000
      window: 5m

  # Rollback rules
  rollback:
    auto: true
    on_failure: true
    baseline_version: latest_stable

# Notifications
notifications:
  slack:
    webhook: ${SLACK_WEBHOOK}
    on_events: [deploy_start, deploy_complete, rollback]

  pagerduty:
    integration_key: ${PAGERDUTY_KEY}
    on_events: [rollback, eval_failure]

Safety Layers

Pre-deployment

  • • Staging environment with full eval suite
  • • Success rate threshold gates
  • • Block deploy if tests fail

During deployment

  • • Gradual canary rollout (5% → 100%)
  • • Real-time health monitoring
  • • Auto-rollback on regression

Catch failures before customers do. Deploy with eval gates, canary rollouts, and auto-rollback.

debugg.ai

10,000+ agent tests/week on Surfs

Start building browser agents
with confidence

Label, train, and deploy production-ready browser agents.

Read DocsWatch Demo

No credit card required • Open source SDKs • Deploy anywhere

surfs.dev Logo
surfs.dev

The easiest way to build reliable AI agents that actually understand the web

Resources

  • Blog & Resources
  • Agentic Browser News
  • Documentation

Company

  • Privacy Policy
  • Terms of Service

© 2026 surfs.dev. All rights reserved.

Cookie Policy