Generative systems behave probabilistically. That means evaluation needs to be continuous, automated, and tied to outcomes the business actually cares about, not just academic benchmarks.
Build an evaluation set that mirrors reality
Sample real prompts, classify them, and curate a high signal evaluation set. Run it on every change to the prompt, model, or retrieval stack.
Catch regressions before users do
Pair offline evaluation with online metrics like deflection, escalation, and satisfaction. Treat the evaluation harness as production infrastructure.