With a suite observability tools, you can confidently evaluate, test and ship LLM apps across your development and production lifecycle. Log traces and spans. Define and compute evaluation metrics. Score LLM outputs. Compare performance between app versions. Record, sort, find, and understand every step that your LLM app makes to generate a result. You can manually annotate and compare LLM results in a table. Log traces in development and production. Run experiments using different prompts, and evaluate them against a test collection. You can choose and run preconfigured evaluation metrics, or create your own using our SDK library. Consult the built-in LLM judges to help you with complex issues such as hallucination detection, factuality and moderation. Opik LLM unit tests built on PyTest provide reliable performance baselines. Build comprehensive test suites for every deployment to evaluate your entire LLM pipe-line.