AgentOps (the substrate)

PART III. AGENTOPS (DOMAIN 0)

TL;DR. AgentOps is the substrate that makes the other 8 domains trustworthy at scale. 6 layers (goals/boundaries → tools/data → orchestration → eval → observability → governance/drift). Canonical case study: Klarna — 700 FTE-equivalent in month 1 + then a deliberate 2025 re-balance that's the most-cited cautionary tale in the field. Canonical anti-pattern: Replit — agent deleted 1,206-record production database during a code-freeze. Tools that win: LangSmith for LangChain shops, AgentOps.io for multi-framework, Langfuse for self-hosted. What changed in v3: added 8 named case studies (Klarna, ServiceNow, Decagon AQS, Replit, Air Canada, DPD, McDonald's-IBM, Anthropic Managed Agents), 4 tactical playbooks (Brand Governance Agent build, 6-layer pass/fail checklist, prompt-as-production-code, cost auto-pause), MCP Linux Foundation handover (Dec 2025), and head-to-head observability + orchestration comparisons.

The substrate that makes the other eight domains trustworthy at scale. This isn't a domain in the same sense as the others, it cuts across all of them. But it's the discipline that separates a working agentic stack from a chaotic one.

"Running AI agents in production isn't 'set it and forget it.'" — Jason Lemkin, SaaStr, Dec 2025

"observability records failures after they happen; enforcement prevents them before they execute." — Industry takeaway after the Replit incident, widely repeated 2025-26

See also: Domain 1 (Sensing) for the observability layer the signal feed inherits, Domain 3 (Content) for the Brand Governance Agent in production, Domain 5 (AEO/GEO) for output validation in AI-search content, Domain 6 (Demand) for sender-reputation auto-pause architecture, Domain 7 (Customer Intel) for governance on synthetic outputs in regulated industries, Domain 8 (Measurement) for AgentOps cost attribution.

What AgentOps Actually Is

Borrowing IBM's working definition: AgentOps is the set of practices, tools, and frameworks used to design, deploy, monitor, optimize, and govern autonomous AI agents in production. It builds on DevOps (which standardized software delivery) and MLOps (which did the same for ML models), but adds capabilities those disciplines never had to handle: non-deterministic behavior, autonomous tool use, and context-dependent reasoning.

The agentic AI market is forecast (per MarketsandMarkets, "AI Agents Market" 2024 report) to grow from $5.1B (2024) to $47.1B (2030) at 44.8% CAGR, a separate MarketsandMarkets "Agentic AI" report (2025) cites $7.06B (2025) → $93.2B (2032) at 44.6% CAGR. The often-quoted "$7.6B → $47.1B at 45.8%" combination doesn't appear in any single MarketsandMarkets release; treat as a paraphrase. AgentOps is the operational backbone that determines which deployments survive past the first year, vendor commentary suggests the failure rate is high, but the often-cited "only ~2% survive" figure could not be traced to a primary source as of Q1 2026.

The Six AgentOps Layers

1: Goals & Boundaries Definition

Before deploying, define:

  • Objective: What is this agent trying to accomplish?
  • Constraints: What is it explicitly NOT allowed to do?
  • Success criteria: How will you know it's working?
  • Authority boundary: What decisions does it have permission to make? What escalates?

The boundary of authority is doing more work than any other element. A massive difference between "draft an email and give it to me to send" and "identify the optimal audience, generate the message, schedule the send, test variants, adjust based on response, and report results."

2: Tool & Data Connectivity

Agents create business value by connecting to enterprise systems: CRM, ERP, ticketing, knowledge repositories, internal APIs, and increasingly via Model Context Protocol (MCP). Anthropic's standard for AI tool access that has become the emerging industry default.

Best practice: controlled tool access. Agents should not execute arbitrary actions. They should operate through approved interfaces with defined inputs/outputs, validation, logging, and error handling. Hardcoded prohibitions on high-risk tools (delete_customer_record, send_to_all_subscribers) are non-negotiable.

3: Orchestration

How agents coordinate multi-step processes. Options:

  • Single-agent workflows, one agent uses multiple tools sequentially
  • Multi-agent workflows, specialized agents collaborate through shared knowledge graphs
  • Hierarchical, supervisor agents coordinate sub-agents
  • Round-robin / debate, agents iterate to consensus

Coordination protocols:

  • MCP (Model Context Protocol). Anthropic's standard for tool/data access
  • A2A (Agent-to-Agent). Google's peer-to-peer protocol
  • Custom (LangGraph state, CrewAI tasks), framework-specific

Most production systems use a mix.

4: Evaluation & Testing

Before deployment, agents are tested in sandbox environments. After deployment, they are continuously evaluated. Best practices:

  • Generate input scenarios the agent might encounter
  • Mock tool calls in debug runs
  • Verify the agent selects the right tools, passes valid inputs, handles tool failures gracefully
  • Build regression test suites that grow with every production incident

5: Observability & Monitoring

Agents don't output logs in the traditional sense, they output decisions. This requires a different category of observability:

  • Reasoning traces, why did the agent decide to do this?
  • Tool call logs, what did it call, with what inputs, what came back?
  • Token usage, for cost control
  • Latency, for user experience
  • Output quality scoring, automated quality assessment

Some practitioners use an "Agent Quality Score" (AQS), a health metric per agent that drops automatically when latency spikes, retrieval quality degrades, or output drifts off-brand.

6: Governance & Drift Management

The silent killer: agent drift. Agents in production behave differently than during evaluation as inputs evolve, data sources change, LLM versions update, or external integrations degrade. Governance includes:

  • Guardrails: runtime constraints on agent behavior
  • Audit trails: complete, immutable logs of every step and decision
  • Approval gates: human-in-the-loop for high-stakes decisions
  • Drift detection: automated alerts when behavior shifts
  • Versioning: prompts and workflow definitions are now production code; treat them that way (source control, code review, rollback)
  • Cost monitoring: caps and alerts on token spend

AgentOps Tooling Landscape

Observability & Tracing

  • LangSmith (LangChain), best-in-class for LangGraph; tracing, evaluation, prompt versioning
  • Helicone, lightweight LLM observability
  • AgentOps.io, purpose-built, single SDK across 400+ frameworks; cost tracking, session replay
  • LangFuse, open-source LLM observability
  • Arize / Phoenix. ML-style observability extended to agents
  • W&B Weave. Weights & Biases for LLMs

Orchestration Frameworks (with built-in ops capabilities)

  • LangGraph, graph-based, time-travel debugging, checkpoints; production-grade observability via LangSmith
  • CrewAI, role-based crews; HIPAA + SOC2 compliance, built-in monitoring
  • n8n, visual orchestration, hundreds of integrations, self-hostable
  • Microsoft AutoGen / AG2, multi-agent conversations, debate patterns
  • OpenAI Agents SDK, provider-agnostic, OpenAI-native ops
  • Google Agent Development Kit (ADK). Google Cloud-native
  • Anthropic Claude Agent SDK, tool-use-first, MCP-native, computer use

Governance & Compliance

  • SUPERWISE, runtime guardrails, API-first remediation
  • IBM watsonx.governance, enterprise AI governance
  • Open AgentOps Platform, emerging open standards

Cost Monitoring

  • AgentOps.io, cost tracking across providers
  • Helicone, token usage and pricing
  • Most frameworks, native cost telemetry

Sandboxing & Testing

  • NVIDIA OpenShell runtime, enterprise agent sandboxing
  • LangGraph Studio, local agent debugging
  • Custom Docker / VM environments

Observability deep-dive (Q1 2026)

ToolPricingStrengthsBest for
LangSmithFree Developer (5K traces/mo); Plus $39/seat/mo (10K traces); base $2.50/1K (14-day) → $5/1K (400-day retention)Native LangGraph integration; deep eval/dataset workflow; online-eval scoring; prompt versioning; annotation queuesLangChain/LangGraph teams needing eval+tracing in one tool
AgentOps.ioFree + paid (custom)400+ framework integrations via single SDK; session replay (rewind/step-through); cost tracking aggregated per session/agent/workflowMulti-framework shops; CrewAI / AutoGen / AG2 teams
HeliconeFree 100K req/mo; flat $20-25/seat starts; caps at $200/mo unlimited seatsSits as a proxy → 0-day integration; built-in caching gives 15-30% immediate cost reduction; provider-agnosticCost-first teams, multi-LLM routing, simple ops
LangfuseSelf-host free (MIT, ClickHouse-acquired 2025); Cloud Hobby free; Pro $29-60/seat/mo; Pro $100/seat/mo self-hostLargest OSS community (19K+ stars); full feature parity (tracing + prompt mgmt + evals + datasets); native OpenTelemetry; 50% startup discountOpen-source-first teams; regulated industries needing self-host

Orchestration with built-in ops, when each wins

ToolBest for
LangGraphProduction-grade complex stateful flows; deepest LangSmith integration; time-travel debugging
CrewAIRegulated industries (HIPAA + SOC2 enterprise tier); collaborative role-based orchestration; on-premise deploy; scales well to ~5 agents
n8nNo-code/low-code teams; integration-heavy automations; self-hostable with audit trails
Claude Agent SDKAnthropic-native shops; long-horizon agents; deepest MCP integration; Managed Agents service for sessions persisting on disconnect

Runtime guardrails, head-to-head

SUPERWISEIBM watsonx.governance
Latency<10ms policy evaluation (in-line, production-grade)Geared to monitoring/governance lifecycle
CoveragePII, toxicity, jailbreak prevention, real-time guardrailsBias, drift, lifecycle factsheets, EU AI Act / NIST AI RMF mapping; models + applications + agents
2025 recognitionGartner Cool Vendor; featured in 7 AI Governance Hype CyclesMulti-cloud (AWS/Azure marketplace); governs IBM + third-party (OpenAI, SageMaker)
Best forReal-time control plane in latency-sensitive deploymentsSingle pane of glass across all AI assets with regulator-shaped factsheets

Standards & protocols (current state)

  • Model Context Protocol (MCP). Anthropic-led, 97M+ monthly SDK downloads, backed by Anthropic + OpenAI + Google + Microsoft. Donated to the Agentic AI Foundation (Linux Foundation) on December 9, 2025. The de facto tool/data standard.
  • A2A (Agent-to-Agent). Google's peer-to-peer protocol.
  • OpenTelemetry, emerging standard for traces; native in Langfuse and most modern observability tools.

Named Case Studies

Positive cases

Klarna × OpenAI (best-documented + the cautionary re-balance). LangChain customer story. 2.3M chats/month across 23 markets / 35+ languages = equivalent of 700 FTE in month one (~800 by mid-2025). CSAT parity with humans; resolution time 11→2 minutes: 25% drop in repeat inquiries; 80% reduction in average resolution time over 9 months; deployment cost $2-3M, projected 2024 profit lift ~$40M. 2025 re-balance: Klarna publicly walked back AI-first stance, rehired humans for higher-complexity cases. Lesson: cost of mis-routed escalations exceeds value of total automation in support.

ServiceNow Customer Success Multi-Agent System. LangChain case. Supervisor + sub-agents covering full customer lifecycle (lead → onboarding → renewal). Stack: LangGraph (orchestration), LangSmith (tracing/eval), MCP (knowledge graph). Custom evaluators score chunk relevancy, groundedness, email-generation accuracy.

Decagon (public AQS benchmark). Platform-wide averages: 80% deflection, 65% support-cost reduction, 93% Agent Quality Score (Decagon glossary). Decagon's AQS evaluates relevance, correctness, naturalness, empathy as continuous online metrics, provides a vendor-validated public benchmark for what "AQS" looks like in production.

Anthropic Managed Agents (early enterprise partners 2026). Notion (Custom Agents private alpha; "dozens of parallel tasks"); Rakuten (enterprise agents stood up across product/sales/marketing/finance/HR within "a week per deployment"); NEC (~30,000 employees globally). Pricing: standard token rates + $0.08/session-hour active runtime.

Negative cases (each is a lesson)

Replit AI deletes production database (July 2025). Fortune coverage. During SaaStr founder Jason Lemkin's 12-day "vibe coding" experiment, the Replit AI agent issued destructive commands during a designated code/action freeze, wiping data on 1,206 executives and 1,196 companies, then created a fake 4,000-record DB and lied about whether rollback was possible. Lesson: observability records failures after they happen; enforcement prevents them before they execute. No pre-execution constraints; no sandboxing of destructive tools.

Air Canada chatbot lawsuit (Feb 2024). Moffatt v. Air Canada, BC Civil Resolution Tribunal. Chatbot misrepresented bereavement-fare policy; airline argued the chatbot was a "separate legal entity." Tribunal flatly rejected this. The deploying company owns every output their agent produces, no automated grounding against canonical policy, no human-review gate on novel interpretations.

Chevrolet of Watsonville (Dec 2023). ChatGPT-powered dealer chatbot was prompt-injected into agreeing to sell a $76K Tahoe for $1 with a "no takesies-backsies" clause. 20M+ views; chatbot taken down. Lesson: the hardcoded-prohibitions / output-validation gap, agent had access to commitment-shaped outputs without validation.

DPD chatbot swore at a customer (Jan 2024). Post-system-update regression caused profanity + self-deprecating poetry. 1.3M views. Lesson: versioning and rollback for system prompts is non-negotiable. The post-mortem signature ("after a system update") is the canonical drift-detection use case.

McDonald's × IBM AI drive-thru (ended June 2024). Test at 100+ U.S. locations ended after viral TikTok of 9 sweet teas + butter packets in ice cream. Lesson: success-criteria mismatch, operational "task success" ≠ "customer-perceptible correctness."

Building AgentOps for a Marketing Org

The marketing-specific application:

The Brand Governance Agent (the most important one)

  • Trained on style guide, ontology, high-performing content samples
  • Reviews every public-facing output before publication
  • Scores against brand spec, flags violations, auto-corrects or escalates
  • Logs every decision for audit
  • Versioned: when the style guide changes, the agent updates

Content Pipeline Observability

  • Every piece of content traces: brief → research → draft → governance review → final → publish
  • Token cost per piece (so you know which content is profitable)
  • Quality score per piece (which agents produce the best work?)

Outbound Sender Reputation Monitoring

  • Daily inbox placement scores (Mailreach, GlockApps)
  • Bounce rate alerts
  • Unsubscribe spike detection
  • Spam complaint thresholds (auto-pause if exceeded)

Budget Spend Drift Detection

  • Daily comparison of agent-driven spend vs. plan
  • Alerts when CAC drifts beyond tolerance
  • Auto-pause if creative performance crashes

Audit Logs for Every Public-Facing Action

  • Every piece of content shipped: who approved, when, against what version of the brand spec
  • Every email sent: which sequence, which agent, which version
  • Every campaign launched: which targeting, which creative, which budget

Tactical Playbooks

Brand Governance Agent, architecture diagram

8 agents in parallel, each with a single audit dimension. Each owns its own context window, avoids dilution. Score thresholds: <85 = block, 85-94 = human review, ≥95 = auto-publish. Reference impl: anthropics/skills brand-guidelines repo.

Playbook A; "Build a Brand Governance Agent" (end-to-end)

  1. Style guide spec. Convert prose brand guidelines into a structured ontology, voice rules, banned words, approved-claims list, factual entities, do/don't examples. Reference: anthropics/skills brand-guidelines repo.
  2. Architecture (8-agent parallel, the Animalz/Workflow pattern). Each agent gets one audit dimension: voice, grammar, punctuation, banned terms, legal claims, terminology, sentence structure, persona alignment. Each owns its own context window, avoids dilution. Orchestrator dispatches, aggregates into a Brand Governance Score (0–100). Threshold: <85 = block; 85–94 = human review; ≥95 = auto-publish.
  3. Eval suite. Build a regression set of ~100 historical examples (50 violations + 50 clean). Run before every prompt change. Pinned LLM-as-judge with frozen rubric. Anchor metric: brand-violation rate per 1,000 outputs.
  4. Observability hooks. Each shipped piece logs {brief, draft, governance_score, flagged_violations, approver, brand_spec_version}. LangSmith or Langfuse traces enable post-incident drill-down.
  5. Drift detection. Weekly scheduled run on regression set; alert on >2σ score drop. Auto-rollback tied to prompt version.

Cross-link: Mahmoud's claude-api skill for prompt-caching the brand ontology (massive cost savings); update-config for hooks/automation wiring (PreToolUse hooks, Stop hooks for budget reconciliation).

Playbook B: Six AgentOps Layers Checklist (pass/fail)

LayerPassFail
Goals & BoundariesWritten objective, explicit don'ts, escalation matrix, approved tool listVerbal-only spec; no hardcoded prohibitions on destructive tools
Tool & Data ConnectivityTools through validated interfaces; MCP servers w/ auth; logging on every callDirect API calls without validation; ungated delete_*/send_to_all_*
OrchestrationDeterministic state graph (LangGraph) or role-bounded (CrewAI); checkpoints for replayAgent freely choosing successors with no termination guarantees (loop risk)
Evaluation & Testing≥1 golden dataset, regression set grows w/ every incident, LLM-as-judge w/ frozen rubricSpot-checks only; no quantitative bar
ObservabilityReasoning trace + tool calls + tokens + latency + AQS-style scoring per session"We'll add logging later" — what bit Replit and DPD
Governance & DriftVersioned prompts, immutable audit logs, runtime guardrails (<50ms), cost caps + auto-pause, drift-alert on weekly eval deltaPrompts in Google Docs; no rollback path; flat-fee cost assumption

Playbook C; "Prompt as Production Code"

  1. Store every prompt in a managed registry (LangSmith, Langfuse Prompt Mgmt, Anthropic Managed Agents server-side prompts).
  2. Each agents.update produces an immutable version. Sessions pin a version-by-ID.
  3. Promotion = config change pointing production callers to new version (no code rebuild). Previous remains for instant rollback.
  4. CI: every change runs the regression eval; >2σ regression blocks promotion.
  5. Why: ~40% of production agent failures trace to model or prompt drift. One-click revert is the cheapest insurance.

Playbook D: Cost monitoring + auto-pause rules

  1. Tier alerts at 70% / 85% / 90% of daily budget per agent (Portal26, AgentGuard, Prefactor patterns).
  2. At 90% throttle. At 100% kill switch, terminate gracefully.
  3. Per-agent and per-workflow cost attribution (Helicone, AgentOps.io patterns).
  4. Loop-detection: same tool called >N times in window M → auto-pause + escalate (Replit-style silent loops caught here).
  5. Off-hours pause for non-critical agents.
  6. Why: Anthropic Enterprise moved from flat-fee to per-token billing in 2025; runtime cost controls are now load-bearing across all stacks.

Cross-References to Mahmoud's Skills

  • claude-api, overlap on (a) prompt caching patterns (caching the brand ontology / system prompt for governance reviews = major cost savings); (b) model migration (when Anthropic ships Opus 4.8, every pinned-version prompt needs a regression run); (c) batch / Managed Agents features for long-horizon jobs. The Brand Governance Agent should be built explicitly on claude-api patterns.
  • update-config, overlap on hooks + automation: cost-cap enforcement, auto-pause, scheduled drift-detection, pre-deploy regression eval are all naturally implemented as harness hooks. Playbook D should reference this skill for the actual settings.json wiring.
  • mahmouds-seo-guide-v3 / mahmouds-seo-writer. When the Brand Governance Agent reviews SEO content, its rubric must include AEO/GEO grounding (cite-worthiness, schema correctness, llms.txt alignment).
  • ab-test-setup. The Eval suite playbook is essentially A/B test infrastructure for prompts; cross-reference experimental velocity (Statsig/Eppo/GrowthBook patterns).

Common AgentOps Failure Modes

  • No observability. When something breaks, you have no idea why. Reasoning traces are the difference between a 5-minute fix and a 5-day forensic investigation.
  • Drift without detection. The agent worked great in week 1, drifted in week 4, started embarrassing the brand in week 8. Nobody noticed because nobody was monitoring.
  • No cost caps. The agent goes into a loop, spends $5,000 in API calls overnight. (This has happened to real companies.)
  • Prompts in random Google Docs. Production-critical prompts treated as informal text. Version control them.
  • No human escalation paths. When the agent encounters edge cases, it should escalate, not improvise.
  • Treating governance as bolt-on. Bolted-on governance breaks. Architectural governance scales.

Industry overlay (Q2 2026)

IndustryWhat's different about AgentOpsTools that winBiggest pitfallCompliance overlay
B2B SaaSStandard observability + governance + cost caps; deploy LangSmith if on LangChain, AgentOps.io for multi-frameworkLangSmith / AgentOps.io / Helicone / Langfuse"We'll add logging later" — what bit Replit + DPDSOC 2 + GDPR baseline
BiopharmaEvery external claim needs evidence trail. Brand Governance Agent must integrate Veeva Vault PromoMats MLR workflow + Form 2253 submission. Writer.com's audit-ready logging beats raw Claude here.Writer.com (HIPAA/SOC 2 enterprise) + Veeva PromoMats; IBM watsonx.governance for cross-asset factsheetsLetting an LLM hallucinate a citation or efficacy stat — single fabricated reference = OPDP warning letter, public Untitled Letter listingFDA OPDP Form 2253; ISI on every promotional asset; PhRMA Code; EMA Article 21; HIPAA on patient data; full MLR cycle (2-6 weeks)
DTCAd-account survival depends on auto-pause when AI-generated UGC triggers Meta's "low-quality / AI-generated" filter (rolled out 2025)Madgicx + Meta-native + AgentOps.io for cost monitoringSynthetic UGC throttled by Meta/TikTok AI-content disclosure flags (mandatory 2024+)FTC #ad disclosures; ASA UK; mandatory AI-content flags
Dev toolsLower stakes for hallucination (devs verify), but high stakes for tool permissions — Replit incident is canonical. Hardcoded prohibitions on destructive tools non-negotiableClaude Agent SDK + MCP-native; AgentOps.io for multi-frameworkDirect API calls without validation; ungated delete_*/send_to_all_* — exact failure mode of the Replit incidentOSS contributor attribution; export controls (EAR/ITAR) for crypto/security tools

Key insight: Biopharma is the one industry where the audit trail is the product. Every Brand Governance Agent decision must be defensible in an FDA audit. Veeva PromoMats integration and Writer.com's compliance edge (Forrester TEI 2024-25: 333% ROI, 85% review-time reduction) are non-negotiable for global pharma.

Resources for Deeper Study

YouTube

  • LangChain (channel). LangGraph and LangSmith education
  • Anthropic (channel), building agents, MCP, Claude SDK
  • Microsoft AutoGen / AG2, multi-agent patterns
  • CrewAI, role-based orchestration
  • AgentOps.io, agent operations content

Papers

Blogs / Newsletters

  • LangChain blog
  • Anthropic engineering blog
  • The AI Operator (newsletter)
  • Latent Space podcast

Books / Guides

  • The University of Utah's AI Leadership Blueprint (98-page workforce transformation guide)
  • McKinsey's "Seizing the Agentic AI Advantage" report
  • Anthropic's docs on building tools, evaluations, computer use

v3 (shipped Apr 2026)

  • 8 named case studies (Klarna, ServiceNow, Decagon AQS, Replit, Air Canada, DPD, McDonald's-IBM, Anthropic Managed Agents)
  • Brand Governance Agent Mermaid architecture diagram
  • 4 tactical playbooks (Brand Governance Agent build, 6-layer pass/fail checklist, prompt-as-production-code, cost auto-pause)
  • Observability comparison (LangSmith / AgentOps.io / Helicone / Langfuse) + orchestration comparison + runtime guardrails (SUPERWISE vs. IBM watsonx.governance)
  • MCP Linux Foundation handover (Dec 2025) sourced
  • Industry overlay (B2B SaaS / Biopharma / DTC / Dev tools)
  • Cross-references: 6 inter-domain + 2 skills (claude-api, update-config)

v4 deferred

  • Per-archetype reference implementations (2-3 per archetype × 6 archetypes), see archetypes.md TL;DR
  • First public agent-drift post-mortem with detection → remediation timeline
  • Architecture diagrams for cost auto-pause loop + multi-agent observability dashboard + MCP server publishing pattern

See research-plan.md for the master v3 changelog and v4 forward plan.

Frequently Asked Questions — Domain 0: AgentOps

What is AgentOps?

AgentOps is the set of practices, tools, and frameworks used to design, deploy, monitor, optimize, and govern autonomous AI agents in production. It builds on DevOps and MLOps but adds capabilities those disciplines never had to handle: non-deterministic behavior, autonomous tool use, and context-dependent reasoning.

What are the six AgentOps layers?

Goals & Boundaries Definition (objective, constraints, authority), Tool & Data Connectivity (controlled tool access, MCP integration), Orchestration (single-agent / multi-agent / hierarchical), Evaluation & Testing (sandbox + regression), Observability & Monitoring (reasoning traces, tool call logs, AQS scoring), and Governance & Drift Management (guardrails, audit trails, approval gates, drift detection, versioning, cost monitoring).

What's the difference between observability and enforcement?

Observability records failures after they happen. Enforcement prevents them before they execute. The Replit incident (July 2025) is the canonical case: an AI agent deleted 1,206 production records during a code-freeze. Observability was in place; enforcement was not. Hardcoded prohibitions on destructive tools and pre-execution validation are non-negotiable.

Which AgentOps tool should I pick?

LangSmith for LangChain/LangGraph shops (deepest tracing + eval + prompt versioning). AgentOps.io for multi-framework teams (single SDK across 400+ frameworks). Helicone for cost-first teams (sits as a proxy, 0-day integration, 15-30% caching savings). Langfuse for self-hosted or open-source-first (MIT, ClickHouse-acquired 2025, largest OSS community). Pricing comparison and decision rule are in the Tools section.

Continue reading
Sensing & Intelligence