

























To build LLM-powered apps, developers need to know how users are using their app.
LLM observability tools help them do this by capturing LLM provider requests and generations, then visualizing and aggregating them. This helps developers monitor, debug, and improve their apps.
To help you pick the best of these tools, we put together this list. All of the following products:
PostHog is an all-in-one developer platform that combines LLM observability with several other developer-focused tools, such as product and web analytics, session replay, feature flags, experiments, error tracking, and surveys.
Its LLM observability product (known as AI Observability) integrates with popular LLM providers, captures details of generations, provides an aggregated metrics dashboard, and more.


PostHog’s AI Observability app works with the rest of our dev tool suite. This means you can visualize LLM-related data along product and business data, create custom queries using SQL, view session replays of AI interactions, A/B test prompts, and more.
Two features worth highlighting for teams iterating on LLM apps:
Prompt management (beta): Create and version prompts directly in PostHog. Prompts are fetched at runtime via the SDK with caching and fallback support, so you can update them without code deploys. Non-engineers can iterate on prompts from the UI, and every change creates an immutable version you can compare, restore, or link to traces to see which prompt versions drive which outputs.
Evaluations (beta): Score LLM outputs automatically or with human review to track quality over time – not just whether API calls succeed, but whether they're actually good.
PostHog’s hosted Cloud version and all of its AI Observability features are free to use. It comes with 100k LLM observability events for free every month with 30 day retention. Beyond this, pricing is usage-based and totally transparent.
Use the setup wizard to get started in minutes – no sales call or elaborate configuration needed.
Langfuse (recently acquired by ClickHouse) is an open source LLM engineering platform. It provides LLM call tracking and tracing, prompt management, evaluation, datasets, and more. These give LLM app developers tools they need for their entire workflow.
Langfuse can be self-hosted for free. If you prefer a managed service, Langfuse Cloud is free to use up to 50k events per month and 2 users, but this only includes 30 day data access. Pricing beyond this starts at $29/m for 100k events with additional events at $8/m more.

Langfuse is one of the original tools in the LLM observability space. This means it has a wide range of tools for LLM app developers to use and have been instrumental in defining what they look like.
It also claims to be the most used open LLMOps platform. Beyond its early entry, this is thanks to its integrations with most LLM providers and agent frameworks, native SDKs for Python and JavaScript, and its ability to act as an OpenTelemetry backend
Langfuse is also the most fully-featured LLM observability tool. Its pricing page lists a huge 78 features from session tracking to batch exports to SOC2 compliance.
Opik is an open source platform for evaluating, testing, and monitoring LLM apps. It provides tracing, annotations, a prompt and model playground, evaluation, and more. It’s built by Comet, an end-to-end model evaluation platform for developers.
Opik’s free hosted plan provides 25k spans per month with unlimited team members and a 60-day data retention. Beyond this, its Pro plan is $19 per month for 100k spans per month with every extra 100k spans costing $5.

Thanks to Opik’s integration with Comet, it’s the only tool on this list that appeals to LLM developers, not just LLM app developers. This means it is ideal for teams training and hosting models of their own, not just using the LLM providers.
OpenLLMetry is an open-source observability product for LLM applications based on OpenTelemetry. It was built by Traceloop and recommends using its SDK to capture data.
Traceloop is free up to 50k spans per month and 5 seats, but this only provides 24-hour data retention. Beyond this, you’ll need to talk to sales.
OpenLLMetry can capture data from a range of LLM providers, vector DBs, and LLM frameworks. It can then send this data to a range of supported destinations from Traceloop to Datadog to Honeycomb.

With its range of extensions and destinations, OpenLLMetry is very likely to integrate with the observability tools you already use.
It integrates with the broader OpenTelemetry ecosystem, meaning it can instrument things like your database, API calls, and more. Their semantic conventions for LLM were also adopted by the OpenTelemetry project.
Phoenix is an open source AI observability platform. It provides tracing, evaluation, experiments, prompt management, and more. It works out-of-the-box with frameworks like LlamaIndex and LangChain as well as LLM providers like OpenAI, Bedrock, and more. It’s built by Arize AI, a unified AI observability and evaluation platform.
Arize doesn’t provide a free hosted version of Phoenix. Their product, AX Pro, starts at $50 per month for 10k spans and up to 3 users.

Similar to OpenLLMetry, Phoenix works well with OpenTelemetry thanks to a set of conventions and plugins that are complimentary to OpenTelemetry. This means Phoenix can more easily integrate into your existing Telemetry stack.
Like Opik, Phoenix is connected to a broader AI development platform. Unique to Arize’s platform is their observability tools for ML and computer vision helping developers debug and improve these systems.
Helicone is an open source platform for monitoring, debugging, and improving LLM applications. Beyond integrations with popular LLM providers, tracing, and an aggregate analytics dashboard, Helicone provides more tools like prompt management and evals.
Recently acquired by Mintlify, it will continue operating in maintenance mode.
Its hosted version is free up to 10,000 requests with some features limited to the $79/m pro and $799/m team plans. The costs for requests beyond the first 10,000 is unknown, though.
![]()
Helicone provides purpose-built tools for improving LLMs, like its prompt playground, prompt management, evaluation scoring, and feedback. This helps developers improve their LLM applications.
For developers focused on performance and reliability concerns, Helicone also contains both proxy and async interfaces for integrating with LLM providers. This ensures Helicone is only on your critical path if you want it to be.
Here's the (short) sales pitch.
We're biased, obviously, but we think you'll love PostHog if:
It's completely free to get started – no credit card required. Our setup wizard handles configuration in minutes, or you can check out our docs to do it yourself.
LLM observability is the practice of monitoring and understanding how your LLM-powered application behaves in production. It typically includes capturing individual LLM calls (inputs, outputs, latency, token usage), aggregating metrics across requests, and providing tools to debug issues and improve model performance.
It's similar to traditional application observability, but focused on the unique characteristics of LLM systems – non-deterministic outputs, high token costs, prompt sensitivity, and the challenge of evaluating quality.
A good LLM observability tool gives you visibility into how your AI-powered app is performing in production. Most solid tools include:
More advanced tools go further with:
If you're building an LLM-powered app and have shipped to real users, you need one. Common signals that you're ready:
Most tools on this list are free to start, so there's no reason to wait.
No. PostHog's LLM observability product is built into the platform, so if you're already using PostHog for product analytics or session replay, you can add LLM observability without any additional setup or contract. You get 100k LLM events free per month.
Getting started is easy; once you install the SDK, it will handle all the heavy lifting. Use your LLM provider as normal and we'll capture everything automatically.
Traditional application monitoring focuses on things like error rates, latency, and uptime – binary metrics where something either works or doesn't.
LLM observability adds a quality dimension: you need to evaluate whether model outputs are actually good, not just whether the API call succeeded. This is why tools like Langfuse and Opik invest heavily in evals, human review, and prompt management – capabilities that don't exist in traditional APM tools.
Most tools on this list support the major providers – OpenAI, Anthropic, Google Gemini, and AWS Bedrock – as well as popular frameworks like LangChain, LlamaIndex, and Vercel AI SDK. Coverage varies by tool. Langfuse and PostHog have the broadest integration coverage.
For specific provider support, check each tool's documentation.
Subscribe to our newsletter
Read by 100,000+ founders and builders
We'll share your email with Substack
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。