Hacker Newsnew | past | comments | ask | show | jobs | submit | oren1531's commentslogin

Try npx agent-triage demo — runs on sample data locally. Would love to hear what you find when you point it at your own traces.

I built agent-triage - CLI that automates diagnosing AI agent failures in production.

I was spending way too much time staring at logs and web dashboards trying to figure out why my multi-agent setups kept failing.

You just point it at your traces (LangSmith, Langfuse, OpenTelemetry, or a JSON file). It pulls the system prompts directly from the logs, extracts the behavioral rules, and uses an LLM-as-a-judge to replay each conversation step-by-step.

It flags exactly which turn broke things, which agent caused it, and traces cascading failures across routing, handoffs, and retrieval.

It aggregates root causes across all of them: "24 out of 51 failures are missing escalations." You know exactly what to fix first.

Runs locally. Only LLM API calls leave your machine. You can try it without installing anything.

https://github.com/converra/agent-triage


Grok's value was never really about model quality - it was the only model with real-time access to what's actually being said on X. And it's less filtered than the others, which matters for certain topics where ChatGPT/Claude will just refuse or hedge. Those are two genuinely unique things. But they're sitting on both advantages without shipping anything meaningful, and competitors will find workarounds eventually. The SpaceX merger adding organizational noise right now seems like the worst possible timing.

Good point on the green/red dashboard. The opportunity cost angle is worth adding though. A failed run isn't just the wasted tokens and retry cost - it's also the task that didn't get done and the engineering required to diagnose why. On anything time-sensitive, that compounds fast.

Exactly. At the moment it's close enough to be a wash for some cases, or tilts seriously one direction or other for others. I expect improved harnesses means more and more we'll just be able to re-run a couple of times, and fall back to "escalating" to Sonnet or even Opus, but whenever it involves egineering time, that's a big deal.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: