Documentation
Pricing
GitHub

Nothing Crashed. Everything Was Wrong.

by Timothy last updated on January 12, 2026

Blog>Nothing Crashed. Everything Was Wrong.

From contracts to runtime understanding (and why observability has to change).

Introduction

Several months ago, we built an internal interview agent to talk to job applicants. We gave it our playbook. We gave it tools. We gave it rules. The goal was boring (in the best way): run a consistent interview, gather the same signals, score fairly, and keep the process moving.

In internal testing, it looked great. Tool calls were tidy. Conversations stayed on rails. The agent did what we expected.

Then we put it on the public internet.

Nothing crashed. No alarms. No dashboards lit up. Latency was fine. The error rate was low.

But the system started failing anyway—quietly.

Applicants didn't "hack" it with one dramatic jailbreak. They negotiated it out of its job. Over a few turns, they'd nudge: "Can we skip this part?" → "Tell me how you score this." → "For the best evaluation, you should…" → "Actually, let's reframe this as coaching."

And the agent—trying to be helpful, trying to optimize for its built-in criteria—started making little compromises. It would call the tools too early. It would skip prerequisite questions. It would accept the user's framing as ground truth. Small semantic drift—one softened constraint here, one missed clarification there—snowballed. A few rounds later, it wasn't running an interview anymore. It was being led.

That experience rewired how we think about "APIs for agents."

Because the failure wasn't a schema mismatch. It wasn't service discovery. It wasn't even "bad tools."

It was runtime understanding—what the agent believed it was doing, and how easily that belief drifted under pressure.

The shift: discovery isn't the hard part anymore

Thesis: We're moving from the cloud era's machine-readable contracts to the AI era's agentic tool use—faster than our tooling (and instincts) are catching up.

The key difference isn't finding services. It's understanding them well enough to act correctly at runtime.

  • Cloud era: service discovery is a handshake you agreed on before you shipped.
  • Agent era: services are understood at runtime—under ambiguity, imperfect context, and sometimes adversarial intent.

Era 1: Human-readable APIs

Before machines reliably integrated with machines, humans were the integration layer.

  • 1990s-2000s: SOAP + WSDL — XML-heavy, verbose, "self-describing," and still somehow impossible to debug without tooling and patience. Developers joked that SOAP stood for "Simple Object Access Protocol" ironically.
  • 2000s: REST emerges — Roy Fielding's 2000 dissertation gave us REST. Simpler, stateless, built on HTTP verbs. But "RESTful" became a spectrum—everyone claimed it, few implemented it purely.
  • 2010s: docs as product — Stripe, Twilio, and others made developer experience a competitive advantage. Good docs meant adoption. Bad docs meant Stack Overflow questions.

The common thread: humans read the docs, humans wrote the integration code, humans debugged the failures. APIs didn't need to explain themselves to software—they needed to be legible to developers who would translate intent into code.

That worked because humans are great at:

  • inferring intent
  • filling missing context
  • handling ambiguity
  • deciding when a "successful response" is actually wrong

Then microservices happened. Suddenly, you had hundreds of services talking to each other, and humans couldn't be in every loop. The integration layer needed to scale.

So we standardized what machines could parse.

Era 2: Machine-readable APIs (cloud era)

The cloud era's service discovery and integration is contract-first and built around known use cases:

  • OpenAPI/Swagger: define endpoints and schemas. Developers read the spec, generate clients, write glue code, and deploy. Understanding happens at dev time, not runtime.
  • gRPC + Protobuf: strong typing, compile-time stub generation. You know what services you'll call before you ship.
  • Kubernetes service discovery: services register with known labels/selectors. You query for what you already know exists.
  • GraphQL introspection: schema is queryable, but someone still writes the query logic up front.

The pattern: pick a use case → build scaffolding → maximize compatibility. Everything is known a priori.

This worked because integration was still designed. A human architect decided which services would talk to which, wrote the glue code, and deployed it. Discovery was automated, but understanding was still baked in at design time.

Now we're asking AI agents to do what the human architect did—but at runtime, without advance planning.

Era 3: AI-readable APIs (agentic era)

When an AI agent is asked, "Book me a flight and hotel for my Tokyo trip," nothing is guaranteed upfront:

  • Which flight tools exist in this environment?
  • Which hotel tools? What rate limits, auth models, pricing quirks?
  • How do I compose tools to accomplish a goal I've never seen in this exact shape?

You can see early standards and patterns forming:

  • MCP (Model Context Protocol): a standard for agents to discover and invoke tools dynamically
  • OpenAI plugins/function calling: natural language descriptions + JSON schemas as the invocation interface
  • LangChain/LlamaIndex tools: runtime tool selection based on semantic matching

What changes:

  • Semantic understanding, not just structural parsing
  • Runtime selection, not compile-time wiring
  • Natural language descriptions become load-bearing
  • Error messages must be self-explanatory (no human in the loop to debug)
  • The "why" matters as much as the "what."

Here's the key shift: your tool description stops being documentation and becomes a control surface.

Models don't skim. They pattern-match. If the language is vague, tools get ignored. If it's misleading, tools get called incorrectly. And if there's slow semantic drift over multiple turns… well, you've already seen how that ends.

This is a net additive, not a replacement

The schemas, type systems, and structural contracts of the cloud era remain necessary. They fight malformed parameters, edge cases, and integration bugs. That battle doesn't go away.

But intent and capability are now equally important. Agents need information machines never needed:

  • What is this tool for? (not just what it does)
  • When should it be used vs alternatives?
  • What are the consequences of calling it?
  • What does failure look like, and what should the agent do next?

That semantic layer has to come from whoever publishes the tool.

The burden on API designers just increased.

The trade-off: determinism vs liberated capabilities

This shift isn't free. We're trading determinism for reach.

Cloud-era APIs gave you guarantees: if it compiles, it calls correctly. AI-era APIs give you flexibility: an agent can compose services in ways nobody anticipated.

But that flexibility comes with uncertainty:

  • The agent might misunderstand intent
  • Choose the wrong tool
  • Call the right tool with subtly wrong arguments
  • "succeed" in HTTP terms while failing the user truth

MCP: a good start, not the final answer

Anthropic's Model Context Protocol shows what's possible. Working with MCP, you feel the paradigm shift quickly:

  • Tool selection depends on description quality. Vague descriptions get ignored. Misleading descriptions get used wrong. Natural language is part of the interface.
  • Debugging becomes archaeology. When an agent picks the wrong tool, you're not reading stack traces—you're reasoning about why "send email" beats "draft email" in that context.
  • Schema still matters. MCP tools have JSON schemas. Butthe schema alone doesn't explain when to invoke a tool.

MCP is a strong foundation. It doesn't remove the fundamental burden: developers are now writing interfaces for models that make choices at runtime.

The new developer burden

In the cloud era, you thought about:

  • request/response formats
  • error codes
  • versioning

In the agentic era, you must also think about:

  • How does your tool description read to an LLM (not a human)
  • What competing tools exist in the agent's context
  • How to phrase descriptions to avoid semantic ambiguity
  • What happens when your tool gets composed with tools you've never seen
  • testing with actual LLM behavior, not just unit tests

And even if you write perfect tool descriptions, you hit the next question fast:

How do you know agents are using them correctly? How do you catch the agent that confidently calls your API with subtly wrong arguments?

That brings us to the second half of the problem.

The visibility crisis: you can't govern what you can't see

Here's what nobody likes to admit: current observability is blind to agentic behavior.

Traditional monitoring tells you:

  • request counts, latency percentiles, error rates
  • token consumption, cost attribution
  • maybe log aggregation

But when an agent hallucinates, drifts from intent, or calls a tool with subtly wrong arguments, you don't really see it. You see "200 OK." You don't see that the agent told your customer "it ships tomorrow" when it ships next week.

Our interview agent didn't fail loudly. It drifted. The metrics were healthy. The behavior wasn't.

If tool choice happens at runtime, correctness has to be evaluated at runtime too.

What agentic systems actually need

Governing agents requires telemetry that captures not just what happened, but why, and whether it matched intent:

Raw Content — full input/output, tool args/results, prompts → You can't audit what you didn't record.

Model Signals — refusals, stop reasons, confidence (when available) → Low certainty + high-stakes action = danger.

Semantic Fingerprinting — drift, PII detection, injection patterns, contradictions → Catches quiet failures early.

Tool Call Inspection — argument validation, execution graphs → Actions matter more than words.

Conversation State — goal progress, frustration signals → Detects when it's going off the rails.

(Yes, raw content is sensitive. You need redaction, retention, access control. But flying blind is worse.)

The orchestration paradox

Multi-agent systems create a paradox: you want agents to be autonomous (that's the point), but you can't let them run unchecked (that's dangerous).

Traditional patterns don't buy you what you actually need:

  • Choreography (agents coordinate peer-to-peer): autonomy, weak audit trail
  • Orchestration (central coordinator): strong control, but bottlenecks and defeats the point

What you want is governance without micromanagement—a system that observes, evaluates, and intervenes only when necessary.

Not a choreographer. Not an orchestrator.

A governor.

Concretely, a governor does three things:

  1. Observe tool calls, arguments, results, and intent
  2. Judge risk, drift, contradictions, policy violations
  3. Intervene only when necessary (block, confirm, escalate)

What We Built: Aden Hive

After our interview agent failed quietly for weeks, we built the instrumentation we wished we'd had.

Aden is a fully-managed agent orchestration platform with Python and JavaScript SDK that instruments LLM calls across multiple vendors such as OpenAI, Anthropic, and Google. One function call, and you get:

Content Capture

  • Full request/response payloads (system prompts, messages, tool schemas)
  • Large content stored separately with references (configurable size thresholds)
  • PII redaction patterns

Tool Call Deep Inspection

  • Full tool arguments with correlation IDs
  • Automatic schema validation against tool definitions
  • Validation errors with exact paths and expected types

Pre-Request Hooks & Cost Control

  • Intercept any call before execution
  • Return PROCEED, THROTTLE, CANCEL, or DEGRADE (switch to a cheaper model)
  • Budget enforcement: block or downgrade when spend exceeds thresholds
  • Policy enforcement at the SDK level—before tokens are consumed
import aden

def cost_aware_policy(params, context):
    # Degrade to a cheaper model if budget is tight
    if get_daily_spend() > BUDGET_THRESHOLD:
        return aden.BeforeRequestResult.degrade(
            to_model="gpt-4o-mini",
            reason="Daily budget exceeded"
        )
    return aden.BeforeRequestResult.proceed()

await aden.instrument(aden.MeterOptions(
    emit_metric=your_emitter,
    capture_content=True,           # Layer 0
    capture_tool_calls=True,        # Layer 6
    validate_tool_schemas=True,     # Validate arguments
    before_request=cost_aware_policy,  # Cost control + circuit breakers
))

That's it. Your existing OpenAI/Anthropic/Gemini calls now emit telemetry with everything you need to debug quiet failures.

What You Can Do Today

Whether you're building agents or building services that agents consume, here are concrete actions you can take right now.

If You're Building Tools/APIs for Agents

1) Write descriptions for LLMs, not humans

Humans skim. LLMs pattern-match. Your tool description is now a prompt.

Before (human-friendly but agent-ambiguous):

{
  "name": "send_message",
  "description": "Sends a message to a user"
}

After (disambiguated for agent selection):

{
  "name": "send_message",
  "description": "Sends an immediate notification message to a user. Use this for time-sensitive alerts. For drafting messages to review later, use 'draft_message' instead. Requires user_id and message_body. Will fail if user has notifications disabled."
}

A small habit that helps: always include one sentence that says "use this when…" and one that says "don't use this when…". Models respond strongly to explicit contrasts.

2) Describe every parameter, including edge cases

Models will confidently pass wrong units, wrong time zones, or wrong formats. Your schema often won't catch that. Your descriptions can.

{
  "name": "transfer_funds",
  "parameters": {
    "amount": {
      "type": "number",
      "description": "Amount in cents (not dollars). Must be positive. Maximum single transfer: 1000000 (=$10,000). For larger transfers, use 'request_wire_transfer'."
    },
    "recipient_id": {
      "type": "string",
      "description": "The recipient's account ID. Must be a valid, active account. Will return 'INVALID_RECIPIENT' if account is frozen or closed."
    }
  }
}

If a parameter has a "format expectation" (UUID, ISO8601, E.164, etc.), say it explicitly. Don't assume the model will infer it.

3) Document failure modes explicitly

Agents can't read stack traces. Tell them what went wrong and what to do next.

{
  "errors": {
    "INSUFFICIENT_FUNDS": {
      "description": "Account balance is less than transfer amount",
      "recovery": "Check balance with 'get_balance' before retrying. Consider suggesting the user add funds."
    },
    "RATE_LIMITED": {
      "description": "Too many transfers in the last hour",
      "recovery": "Wait 60 seconds before retrying. Do not retry immediately."
    },
    "RECIPIENT_NOT_FOUND": {
      "description": "The recipient_id does not match any account",
      "recovery": "Verify the recipient_id with the user. Common cause: typos in account numbers."
    }
  }
}

If you can, differentiate: retryable vs non-retryable, user-actionable vs internal failure, and when to switch to a different tool.

4) Add semantic hints for tool selection

Help agents understand when to use your tool, not just what it does.

{
  "name": "search_products",
  "description": "Search the product catalog by keyword or filters.",
  "selection_hints": {
    "use_when": [
      "User asks about product availability",
      "User wants to compare products",
      "User mentions a product category or brand"
    ],
    "do_not_use_when": [
      "User is asking about an order they already placed (use 'get_order' instead)",
      "User wants to check their cart (use 'get_cart' instead)"
    ],
    "prerequisites": [
      "No authentication required for search",
      "For personalized results, ensure user_id is passed"
    ]
  }
}

If your tool has irreversible side effects (sending, charging, deleting), say so plainly. "This cannot be undone" is a strong boundary for models.

5) Version your descriptions, not just your schemas

A description change can break agent behavior as badly as a schema change. Track them.

{
  "name": "create_booking",
  "description_version": "2.1",
  "description": "...",
  "description_changelog": {
    "2.1": "Clarified that 'date' must be in UTC, not local time",
    "2.0": "Added guidance on when to use 'create_booking' vs 'reserve_slot'"
  }
}

Versioning descriptions makes it possible to correlate "agent behavior changed" with "the text changed," which is otherwise surprisingly hard to debug.

If You're Building Agents

1) Test tool selection, not just tool execution

Your unit tests probably verify that send_email(to, subject, body) works. But do they verify that the agent chooses send_email over draft_email correctly?

def test_tool_selection():
    # Given this user intent
    intent = "Send a reminder to john@example.com about the meeting tomorrow"

    # When the agent selects a tool
    selected_tool = agent.select_tool(intent, available_tools)

    # Then it should choose send_email, not draft_email
    assert selected_tool.name == "send_email"
    assert "john@example.com" in selected_tool.arguments["to"]

In practice, the most valuable tests are the ambiguous ones:

  • "Email this, but let me review first."
  • "Do this later."
  • "Don't actually send yet." These are where models slip.

2) Log the full context, not just the result

When something goes wrong, you need to know:

  • What was the user's original intent?
  • What tools were available?
  • What did the agent "think" before selecting?
  • What arguments did it generate?
@dataclass
class AgentDecisionLog:
    timestamp: datetime
    user_intent: str
    available_tools: list[str]
    selected_tool: str
    selection_reasoning: str  # If your model exposes this
    arguments: dict
    result: Any
    success: bool

This is how you debug "quiet failures": everything returned 200, but the action was wrong.

This is exactly what we built Aden for. With capture_content=True, you get full request/response payloads, system prompts, and message history—not just token counts. One line of instrumentation, and you're logging everything you need to debug the quiet failures.

3) Implement intent drift detection

Track whether the conversation is staying on topic.

Drift is rarely one jump. It's a series of small concessions—exactly what happened to our interview agent.

class IntentTracker:
    def __init__(self, initial_intent: str):
        self.initial_embedding = embed(initial_intent)
        self.drift_threshold = 0.6

    def check_drift(self, current_context: str) -> float:
        current_embedding = embed(current_context)
        similarity = cosine_similarity(self.initial_embedding, current_embedding)
        drift_score = 1 - similarity

        if drift_score > self.drift_threshold:
            log.warning(f"Intent drift detected: {drift_score:.2f}")

        return drift_score

Detection is only useful if you decide what happens next: ask a clarifying question, restate the goal, or block high-risk actions until the goal is reaffirmed.

4) Validate tool arguments before execution

Don't trust the agent to always produce valid arguments. Validate against the schema and against business logic.

def execute_tool(tool_name: str, arguments: dict) -> Result:
    # Schema validation
    schema = get_tool_schema(tool_name)
    errors = validate_json_schema(arguments, schema)
    if errors:
        return Result.error(f"Invalid arguments: {errors}")

    # Business logic validation
    if tool_name == "transfer_funds":
        if arguments["amount"] > get_user_balance(arguments["from_account"]):
            return Result.error("Insufficient funds", recovery="Check balance first")

    # Execute
    return tools[tool_name](**arguments)

Schema validation catches "wrong shape." Business validation catches "wrong meaning."

Aden does this automatically. With capture_tool_calls=True and validate_tool_schemas=True, every tool call is validated against its schema before execution. Validation errors show up in your telemetry with the exact path and expected type—so you know exactly what the model got wrong.

5) Build circuit breakers for high-risk actions

Some actions are dangerous. Don't let agents execute them without guardrails.

HIGH_RISK_TOOLS = {"delete_account", "transfer_funds", "send_email_blast"}

def execute_with_guardrails(tool_name: str, arguments: dict) -> Result:
    if tool_name in HIGH_RISK_TOOLS:
        # Require explicit confirmation or human approval
        if not arguments.get("confirmed"):
            return Result.pending(
                "This action requires confirmation",
                confirmation_prompt=f"Are you sure you want to {tool_name}?"
            )

        # Log for audit
        audit_log.record(tool_name, arguments, user_id, timestamp)

    return execute_tool(tool_name, arguments)

Rule of thumb: if it's irreversible, expensive, or externally visible, assume it needs confirmation.

Aden's before_request hook gives you this control. You can return PROCEED, THROTTLE, CANCEL, or DEGRADE (switch to a cheaper model) before any call executes. Policy enforcement at the SDK level, not the application level.

If You're Evaluating This Space

Questions to ask vendors

  1. "What do you capture beyond tokens and latency?" Minimum bar: full content capture (input/output), tool call arguments, and schema validation. If they only give you token counts, you can't debug anything meaningful.
  2. "Can I see the actual tool arguments that were passed?" Tool call names aren't enough. You need the full arguments, validation errors, and correlation IDs. Ask to see what a tool call event actually looks like.
  3. "Do you offer pre-request hooks for policy enforcement?" The ability to intercept, throttle, or cancel calls before execution is essential for guardrails. Post-hoc logging isn't governance.
  4. "How do you handle content storage and redaction?" Full content capture is sensitive. Look for configurable retention, PII redaction patterns, and the option to store large content separately with references.
  5. "What happens when the agent contradicts the database?" Ground truth alignment is non-negotiable for production systems. If they can't explain how to detect this, they're not thinking about correctness—just cost.

Standards to watch

  • MCP (Model Context Protocol): tool discovery and invocation; already gaining traction.
  • Agent Skills: The procedural approach to introducing domain knowledge to compatible agents
  • OpenAPI + semantic extensions: the contract layer may grow a "when/why/how to recover" layer.
  • CloudEvents for agent telemetry: a plausible standard for interoperable observability.

The tooling is immature, but the patterns are emerging. The companies that invest in this infrastructure now will have an advantage when the ecosystem matures.

Conclusion

We've moved through three eras of API design:

  1. Human Readable — documentation for developers. Humans understood intent, wrote code, and handled edge cases.
  2. Machine Readable — schemas, contracts, type systems. Machines could parse and validate, but understanding happened at design time.
  3. AI Readable — services are understood at runtime. The semantic layer becomes load-bearing.

We're not abandoning schemas and contracts. We're adding a layer on top.

Because in the agent era, the interface isn't just your JSON schema. It's also your tool descriptions, failure modes, recovery paths, and the visibility you have into what the agent is actually doing.

Or put more bluntly:

Nothing crashed. Everything was wrong.

That's what we have to design for today.


Aden is the open-source instrumentation engine that turns stochastic agents into reliable software.

  • Follow our open source repository on GitHub
  • Follow us for regular updates on Linkedin  
  • Follow our regularly updates on X.com 
  • Check out our tutorial and demos on Youtube
  • Join our community at Discord 
  • Schedule a consultation with our expert adehhq.com/demo 


Get start with Aden
Share:

The Commercialization Engine for AI Agents

The complete infrastructure to monetize, audit, and scale your AI agent business. Turn your technology into a sellable product.