Evaluations & Observability – Measure What Matters
We’ve reached the final day of Launch Week. Over the past four days, we’ve given you the tools to build production-grade AI agents:
- Day 1: Tool Groups to eliminate context pollution
- Day 2: Custom Tools for surgical precision
- Day 3: Token Optimization to maximize efficiency
- Day 4: Enterprise Integrations to break down silos
Today we’re addressing one of the top requests we’ve been hearing from customers: How do you know if your agent is working as expected?
We’re releasing: Evaluations Framework and Observability Dashboard.
The Challenge: Visibility into Agent Behavior
You’ve built an e-commerce agent. You’ve scoped it to the right tools. You’ve optimized token usage. Now you need visibility into production:
- Which tools are actually being called?
- Are the tools being used correctly?
- Where are agents failing?
- What’s your actual usage and cost?
- How do new tool configurations impact success rates?
Without visibility, you’re flying blind. You can’t optimize what you can’t measure.
This is especially critical when you’re working with Tool Groups. When you switch from groups=ecommerce to a custom tool selection, did you accidentally break a critical workflow? You won’t know until a customer complains.
The Solution: Two-Layer Visibility
We’ve built a complete visibility stack with two complementary systems:
1. MCP Evaluations Framework (Development & Testing)
Automated testing framework powered by mcpjam that validates agent behavior before production
2. Observability Dashboard (Production Monitoring)
Real-time usage analytics dashboard in Bright Data’s Control Panel that tracks every API call in production
Let’s dive into each layer.
Layer 1: MCP Evaluations Framework
What is mcpjam?
mcpjam is the official evaluation CLI for Model Context Protocol servers. Think of it as “integration testing for AI agents.”
You write test cases as natural language queries, specify which tools should be called, and mcpjam runs your agent through the workflow automatically.
How We Use It
We’ve built a comprehensive evaluation suite for every Tool Group we shipped on Day 1. When you configure a new tool selection, you can run these evals to verify everything works before deploying.
Project Structure
mcp-evals/
├── server-configs/ # Server connection configs per tool group
│ ├── server-config.ecommerce.json
│ ├── server-config.social.json
│ ├── server-config.business.json
│ ├── server-config.browser.json
│ └── ...
├── tool-groups.json/ # Test cases per tool group
│ ├── tool-groups.ecommerce.json
│ ├── tool-groups.social.json
│ ├── tool-groups.business.json
│ ├── tool-groups.browser.json
│ └── ...
└── llms.json # LLM provider API keys
Each tool group gets its own test suite with real-world queries that agents should be able to handle.
Example: E-commerce Eval
From mcp-evals/tool-groups.json/tool-groups.ecommerce.json:
{
"title": "Test E-commerce - Amazon product search",
"query": "Search for wireless headphones on Amazon and show me the top products with reviews",
"runs": 1,
"model": "gpt-5.1-2025-11-13",
"provider": "openai",
"expectedToolCalls": ["web_data_amazon_product_search"],
"selectedServers": ["ecommerce-server"],
"advancedConfig": {
"instructions": "You are a shopping assistant helping users find products on Amazon",
"temperature": 0.1,
"maxSteps": 5,
"toolChoice": "required"
}
}
This test validates that:
- The agent correctly interprets the user query
- It calls the right tool (
web_data_amazon_product_search) - It passes appropriate parameters (product keyword, Amazon URL)
- It completes within the configured timeout
- It returns a coherent response
Running Evals: Quick Start
Install mcpjam:
npm install -g @mcpjam/cli
Run e-commerce tool group tests:
mcpjam evals run \
-t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \
-e mcp-evals/server-configs/server-config.ecommerce.json \
-l mcp-evals/llms.json
Expected Output:
Running tests
Connected to 1 server: ecommerce-server
Found 13 total tools
Running 2 tests
Test 1: Test E-commerce - Amazon product search
Using openai:gpt-5.1-2025-11-13
run 1/1
user: Search for wireless headphones on Amazon and show me the top products with reviews
[tool-call] web_data_amazon_product_search
{
"keyword": "wireless headphones",
"url": "https://www.amazon.com"
}
[tool-result] web_data_amazon_product_search
{
"content": [...]
}
assistant: Here are some of the top wireless headphones currently on Amazon...
Expected: [web_data_amazon_product_search]
Actual: [web_data_amazon_product_search]
PASS (23.8s)
Tokens • input 20923 • output 1363 • total 22286
What Gets Tested
We’ve built eval suites for all 8 Tool Groups from Day 1:
| Tool Group | Test Coverage | Example Queries |
|---|---|---|
| ecommerce | Amazon, Walmart, Best Buy product searches | “Compare iPhone 15 prices across retailers” |
| social | TikTok content, Instagram posts, Twitter trends | “Find trending TikTok videos about AI” |
| business | LinkedIn profiles, Crunchbase funding data, Google Maps locations | “Find the LinkedIn profile for the CEO of Microsoft” |
| research | GitHub repos, Reuters news, academic sources | “Find Python repos for web scraping with 1k+ stars” |
| finance | Stock data, market trends, financial news | “Get the latest stock price for NVIDIA” |
| app_stores | iOS App Store, Google Play reviews & ratings | “Find top-rated meditation apps on iOS” |
| browser | Scraping Browser automation workflows | “Navigate to Amazon and add an item to cart” |
| advanced_scraping | Batch operations, custom scraping | “Scrape product data from a custom website” |
Each test suite contains 2-5 core test cases covering the most common agent workflows for that domain.
Why This Matters
Evals give you:
- Regression Testing: Run evals after every config change to ensure you didn’t break existing workflows
- Performance Benchmarking: Track token usage and latency across different LLM models
- Tool Validation: Verify that tool selection logic is working correctly
- Documentation: Test cases serve as executable examples of what your agent can do
Before Day 1’s Tool Groups, we had no systematic way to test whether switching from groups=ecommerce to groups=ecommerce,social would break agent behavior. Now we do.
Layer 2: Observability Dashboard
Real-Time Production Monitoring
While evals handle pre-deployment testing, the Observability Dashboard gives you real-time visibility into production usage.
We’ve integrated a new MCP Usage panel into Bright Data’s Control Panel that tracks every API call made through your MCP server.
What You See
The dashboard displays a comprehensive usage table with:
| Date | Tool | Client Name | URL | Status |
|---|---|---|---|---|
| 2025-11-26 14:32:15 | web_data_amazon_product | my-ecommerce-agent | https://amazon.com/… | Success |
| 2025-11-26 14:31:52 | search_engine | my-research-bot | N/A | Success |
| 2025-11-26 14:30:18 | web_data_linkedin_person_profile | lead-gen-agent | https://linkedin.com/in/… | Success |
| 2025-11-26 14:29:03 | scraping_browser_navigate | automation-agent | https://example.com | Failed |
Key Metrics
1. Tool Usage Breakdown
See which tools are being called most frequently:
web_data_amazon_product: 1,243 calls
search_engine: 892 calls
web_data_linkedin_person_profile: 634 calls
scrape_as_markdown: 421 calls
This tells you which datasets are most valuable to your agents. If you’re paying for unused tool groups, you’ll see it here.
2. Client Identification
Every agent instance can be tagged with a client name (via the client_name parameter in the connection URL):
npx -y @brightdata/mcp
The dashboard groups usage by client, so you can track costs per agent/workflow.
3. Success vs. Failure Rates
Monitor agent reliability:
Total Requests: 3,190
Successful: 3,102 (97.2%)
Failed: 88 (2.8%)
Click into failed requests to see error details and debug issues.
4. URL Tracking
For dataset tools, the dashboard shows which URLs/resources were accessed. This helps you:
- Identify rate-limiting issues (too many requests to same domain)
- Track which specific products/profiles/pages are being scraped
- Audit compliance (ensure agents aren’t accessing restricted sites)
How to Access
- Log into Bright Data Control Panel
- Navigate to MCP Usage (new section in the sidebar)
- View real-time usage data for all your MCP connections
Filters:
- Date range (last 24 hours, 7 days, 30 days, custom)
- Tool name (filter by specific tools)
- Client name (filter by agent instance)
- Status (success/failure)
Export:
Download usage data as CSV for deeper analysis or BI tool integration.
Combined Workflow: Development → Production
Here’s how the two systems work together:
Phase 1: Development (Pre-Deployment)
- Configure Tool Groups using Day 1’s feature
npx -y @brightdata/mcp - Run Evals to validate tool selection
mcpjam evals run \ -t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \ -e mcp-evals/server-configs/server-config.ecommerce.json \ -l mcp-evals/llms.json - Review Results: Ensure all tests pass
- Token usage is within budget
- Correct tools are being called
- Responses are accurate
- Iterate: If tests fail, adjust tool selection or system prompts
Phase 2: Production (Post-Deployment)
- Deploy Agent with client name tagging
npx -y @brightdata/mcp - Monitor Dashboard: Check real-time usage
- Are success rates consistent with eval results?
- Are unexpected tools being called?
- Any rate limiting or authentication issues?
- Analyze Trends: Over time, look for:
- Usage spikes (need to scale?)
- Failure pattern changes (tool degradation?)
- Cost anomalies (optimize token usage)
- Optimize: Use dashboard insights to refine tool selection
- Remove unused tools (reduce token costs)
- Add missing tools (improve success rates)
- Adjust rate limits (avoid throttling)
- Re-Run Evals: After any config change, run evals again to ensure no regressions
Performance Stats: Launch Week Recap
Let’s bring it all together. Here’s the cumulative impact of all 5 days:
Day 1: Tool Groups
Impact: 60% reduction in system prompt tokens
Example: Full suite (200+ tools) → Single group (25 tools)
Token Savings: ~8,000 tokens per request (system prompt)
Day 2: Custom Tools
Impact: 85% reduction vs. full suite when selecting 4 specific tools
Example: Full suite (200+ tools) → Custom (4 tools)
Token Savings: ~9,500 tokens per request (system prompt)
Day 3: Token Optimization
Impact: 30-60% reduction in tool response tokens
Example: Web scraping + dataset tools in single workflow
Token Savings: ~10,250 tokens per request (tool outputs)
Combined Effect: E-commerce Agent Workflow
Scenario: “Find top 5 Amazon headphones under $100, summarize reviews”
| Configuration | System Prompt | Tool Outputs | Total Tokens | Cost per Request |
|---|---|---|---|---|
| Full Suite (No Optimization) | 15,000 | 22,500 | 37,500 | $0.45 |
| + Tool Groups | 6,000 | 22,500 | 28,500 | $0.34 |
| + Custom Tools | 2,250 | 22,500 | 24,750 | $0.30 |
| + Token Optimization | 2,250 | 12,250 | 14,500 | $0.17 |
Total Reduction: 61.3% fewer tokens, 62.2% lower cost
At 1,000 requests/day, that’s $280/day savings or $102,200/year.
Day 4: Enterprise Integrations
Impact: Eliminated custom ETL overhead
Time Savings: Weeks of engineering work → Minutes of configuration
Maintenance: Zero (handled by Bright Data)
Day 5: Evals + Observability
Impact: Proactive quality control + production visibility
Failure Reduction: 10-15% improvement in success rates (via early issue detection)
Cost Avoidance: Catch regressions before production (save hundreds of failed requests)
Try It Out: Get Started Today
Step 1: Run Your First Eval
# Install mcpjam
npm install -g @mcpjam/cli
# Clone The Web MCP repo
git clone https://github.com/brightdata/brightdata-mcp-sse.git
cd brightdata-mcp-sse
# Configure your API keys in mcp-evals/llms.json
# Configure your Bright Data token in server configs
# Run e-commerce evals
mcpjam evals run \
-t mcp-evals/tool-groups.json/tool-groups.ecommerce.json \
-e mcp-evals/server-configs/server-config.ecommerce.json \
-l mcp-evals/llms.json
Step 2: Access the Observability Dashboard
- Sign up at Bright Data
- Navigate to MCP Usage in the Control Panel
- Deploy an agent and watch real-time usage data appear
Step 3: Iterate
Use evals to test configurations. Use the dashboard to monitor production. Rinse and repeat.
Resources
MCP Evaluations:
- mcpjam GitHub — Official evaluation CLI
- Model Context Protocol — Official MCP specification
Observability Dashboard:
- Bright Data Control Panel — Access your usage dashboard
- API Documentation — Full API reference
The Web MCP Server:
- GitHub Repository — Open-source server code
- NPM Package — Install via npm
Launch Week Recap:
- Day 1: Tool Groups — Eliminate context pollution
- Day 2: Custom Tools — Surgical tool selection
- Day 3: Token Optimization — Maximize efficiency
- Day 4: Enterprise Integrations — Break down silos
- Day 5: Evals & Observability — Measure what matters (you are here)
Launch Week: A Final Word
Five days. Five major releases. One mission: Make AI Agents Production-Ready.
We started with the insight that context pollution is the biggest bottleneck in agentic workflows. We gave you Tool Groups to scope your context.
Then we realized even groups aren’t precise enough. We shipped Custom Tools for surgical precision.
Next, we tackled the output side: token-bloated responses. We integrated markdown stripping via Strip-Markdown and intelligent payload cleaning with Parsed Light.
After that, we brought Bright Data to the platforms enterprises actually use: Google ADK, IBM watsonx, Databricks, and Snowflake.
And today, we closed the loop with evaluations and observability. Because you can’t improve what you can’t measure.
This is the full stack for production AI agents:
- Tool Groups → Reduce context pollution
- Custom Tools → Maximize precision
- Token Optimization → Minimize costs
- Enterprise Integrations → Deploy anywhere
- Evals + Observability → Maintain quality
Thank You
To everyone who followed along this week: thank you.
To the developers building the next generation of AI agents: we can’t wait to see what you build.
To the enterprises deploying AI at scale: we’re here to make it reliable.
And to the open-source community that made MCP possible: this is just the beginning.
Let’s build the future of AI together.