How are people debugging multi-agent AI workflows in production?

1 pointsposted 11 hours ago
by skhatter

3 Comments

skhatter

11 hours ago

I've been experimenting with AI agents and multi-step workflows recently and ran into a problem that reminded me a lot of early distributed systems.

Once agents start calling tools, APIs, and other agents in a chain, debugging failures becomes surprisingly hard. A single task can involve multiple steps—LLM calls, tool invocations, retries—and when something breaks it's often difficult to understand exactly what happened or where the failure originated.

In traditional distributed systems we eventually built things like tracing, circuit breakers, retry policies, SLOs, and other reliability primitives to operate systems safely in production.

I'm curious how people building agent systems today are handling this.

Some questions I'm particularly interested in: - How do you debug agent failures? - Do you have visibility into multi-agent workflows? - How do you replay or reproduce failures?

I've been exploring this problem space and built a small prototype to experiment with reliability tooling for agent systems. The link above shows the demo, but I'm mainly interested in learning how others are approaching this problem.

verdverm

11 hours ago

OTEL and LGTM, the same open source o11y stack I use for everything

skhatter

11 hours ago

Interesting — are you instrumenting the agent workflows themselves with OpenTelemetry spans?

I was wondering how well the standard o11y stack works once agents start running multi-step workflows (agent → tools → other agents → APIs). Tracing probably helps visualize the steps, but I'm curious how people handle operational things like retries, replaying failed workflows, or containing cascading failures across agents.

Those reliability aspects are what I've been exploring.