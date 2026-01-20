New benchmark shows top LLM achieve only 29% pass rate on OpenTelemetry instrumentation, exposing the gap between coding ability and real-world SRE work.

Quesma, Inc. announced the release of OTelBench, the first comprehensive benchmark for evaluating LLMs on OpenTelemetry instrumentation tasks, revealing significant gaps in AI's ability to handle production-grade Site Reliability Engineering (SRE) work.

While frontier LLMs have demonstrated impressive coding capabilities, the best-performing model, Claude Opus 4.5, achieved only a 29% pass rate, compared to 80.9% pass rate in the SWE-Bench, highlighting a critical gap in production engineering skills.

Enterprise outages cost an average of $1.4 million per hour, making production visibility mission-critical. Yet 39% of organizations cite complexity as their top observability obstacle. The benchmark exposed context propagation as an insurmountable barrier for most models, a particularly concerning finding given that context propagation is fundamental to distributed tracing.

"The backbone of the software industry consists of complex, high-scale production systems with mission-critical reliability," said Jacek Migdal, founder of Quesma. "OTelBench shows that while LLMs are impressive at generating code, they're not yet capable of fundamental instrumentation task even at a small scale, and end-to-end problem-solving required for production engineering. Many vendors are marketing AI SRE solutions with bold claims but no independent verification."

Models had some moderate success with Go and, quite surprisingly, C++. A few tasks were completed for JavaScript, PHP, .NET, and Python. Just a single model solved a single task in Rust. None of the models solved a single task in Swift, Ruby, or Java.

"AI SRE in 2026 is what DevOps Anomaly Detection was in 2016; lots of marketing but lacking independent benchmarks," Migdal added. "That's why we're releasing OTelBench as open-source: to create a North Star for navigating the AI hype and enable the community to track real progress."

OTelBench is available today at https://quesma.com/benchmarks/otel/.

