The Scale.ai for Browser Agents

Training data that
makes agents smarter

Turn browser agent sessions into high-quality training datasets. Label, verify, and export data for fine-tuning and RLHF.

50M+
Sessions Processed
99.2%
Label Accuracy
< 1hr
Time to Dataset

From raw sessions to training-ready data

Automated pipeline that transforms agent sessions into clean, labeled datasets.

Capture

SDK automatically captures sessions with full context

Auto-Label

ML models label success, failure, and edge cases

Verify

Human annotators review uncertain labels

Export

Download in your preferred training format

Dataset Types

Data for every training approach

Whether you're doing imitation learning, RLHF, or building custom models.

Behavioral Cloning

Learn from successful agent trajectories. Capture action sequences, DOM states, and decision contexts for imitation learning.

Format: Trajectory pairs
Train agents to replicate expert behavior

RLHF Data

Preference pairs for reinforcement learning from human feedback. Compare agent behaviors and rank outcomes.

Format: Preference pairs
Align agent behavior with human preferences

Failure Analysis

Labeled failure modes with root cause annotations. Build datasets for failure detection and recovery.

Format: Error + context pairs
Train robust error handling

Task Completion

End-to-end task demonstrations with step-by-step breakdowns. Full context from goal to completion.

Format: Task trajectories
Multi-step task learning
Platform Features

Enterprise-grade data infrastructure

Automatic Labeling

Sessions are automatically labeled based on outcomes, DOM state changes, and action sequences. Define custom rules for your specific success criteria.

Human-in-the-Loop

Route uncertain labels to human annotators. Built-in quality control with agreement tracking, conflict resolution, and annotator performance metrics.

Flexible Export

Export to OpenAI, Anthropic, or custom formats. JSONL, Parquet, or direct integration with your training pipeline via API.

Active Learning

Identify which examples will most improve your model. Prioritize labeling for high-impact, uncertain, or edge-case sessions.

Data Quality

Automatic PII detection and redaction. Deduplication, outlier detection, and consistency checks ensure clean training data.

Version Control

Track dataset versions with full lineage. Compare model performance across dataset iterations. Roll back to any previous version.

Export with a single API call

export.py
from surfs import TrainingData

# Initialize client
client = TrainingData(api_key="your-api-key")

# Build dataset with filters
dataset = client.create_dataset(
    name="checkout-flow-v2",
    filters={
        "task_type": "checkout",
        "outcome": "success",
        "min_steps": 5,
        "date_range": "last_30_days"
    },
    labeling={
        "auto_label": True,
        "human_review": "uncertain",  # Review uncertain cases
        "quality_threshold": 0.95
    }
)

# Export to OpenAI fine-tuning format
dataset.export(
    format="openai_jsonl",
    output="training_data.jsonl",
    include_context=True  # Include DOM snapshots
)

print(f"Exported {dataset.size} examples")
# Output: Exported 12,847 examples

Frequently asked questions

What formats can I export training data to?

Export to JSON, JSONL, Parquet, or custom formats. We support OpenAI fine-tuning format, Anthropic's format, and custom schemas for your training pipeline.

How does automatic session labeling work?

Our system analyzes session outcomes, DOM changes, and action sequences to automatically label success/failure. You can define custom labeling rules based on selectors, URLs, or API responses.

Can I use human annotators to verify labels?

Yes. Our human-in-the-loop workflow lets you send uncertain labels to annotators for verification. Built-in quality control tracks annotator agreement and flags inconsistencies.

How much training data do I need for fine-tuning?

It depends on your task complexity. Most teams see improvements with 1,000-10,000 high-quality examples. Our platform helps you identify which examples add the most value to your dataset.

Can I filter training data by success rate or task type?

Absolutely. Filter by outcome, duration, cost, error type, or any custom metadata. Build focused datasets for specific behaviors or edge cases.

Start building training datasets

Join teams using surfs.dev to create high-quality training data for their browser agents.