I keep seeing AI teams obsess over model choice, prompts, and infrastructure, but very few invest in structured evals early. Without them, you are basically shipping blind. In my experience, good eval workflows catch issues before they hit production, shorten iteration cycles, and prevent those “works in testing, fails in prod” disasters.

At Maxim AI we’ve seen teams slash AI feature rollout time just by setting up continuous eval loops with both human and automated tests. If your AI product handles real user-facing tasks, you cannot rely on spot checks. You need evals that mimic the exact scenarios your users will throw at the system.

What’s your take, are evals an engineering must-have or just a nice-to-have?


Comments

HiddenoO8 points

grab obtainable ghost school lunchroom childlike innate rainstorm gray oil

This post was mass deleted and anonymized with Redact

track me

SkyFeistyLlama81 points

How about running evals in production and stopping catastrophically bad generations? I’ve had better luck with running multiple simultaneous prompts to check if some agentic behavior has been fulfilled, and then doing a voting mechanism to proceed to the next step.

It’s kind of crazy to be thinking of software outputs as probabilistic.

HiddenoO1 points

modern historical melodic amusing bow lock grandfather joke imminent soup

This post was mass deleted and anonymized with Redact

No_Efficiency_11441 points

Black Swan theory is huge for evals yes.

the__storm1 points

I agree, I think having some metric to iterate and validate against is absolutely essential, but it’s an endless battle with management getting time to work on evals (and if SMEs are needed, getting data in a useable format from them).

No_Efficiency_11441 points

Reward hacking is the biggest issue but I don’t really have a solution LOL.

Truly exactly verifiable rewards are clearly best which mostly applies to math. With Lean4 it applies to code sometimes.

Outside of that it gets trickier but ensembles of benchmarks or verifiers can help. LLM-as-judge is probably under-rated because people mostly expect it to be horrible but I think it could potentially be a bit less prone to reward hacking especially in ensemble.

paradite1 points

Yeah, evals are very important for ALL AI products.

However, engineers are too busy building. Product managers / founders lack the skills to setup their own evals. So they resort to vibe testing the prompts, which is actually better than LLM-as-judge (due to hallucination) in terms of accuracy, but time-consuming.

I am building a simple tool that help non-technical people set up their evals quickly on their local computer. Seems like a worthy problem to solve.

nore_se_kra1 points

I think often the actual issue is not missing evals but missing eval data.

Depending on the use case, its very expensive to get proper data as all your domain experts ( who might be overloaded already) have to be involved . This is definitely not what the management has in mind when they want to speed up things with AI. So people vibe prompting themselves through PoCs into some kind of bad production that fails fast.

From my point of view the eval data is the actual value and how a company can set itself apart from the competition . Not platforms and frameworks and whatnot everyone wants to sell

ManInTheMoon__481 points

How do you even decide what scenarios to include in evals? Feels like you could miss edge cases anyway

recursive_dev1 points

With judge LLMs, you don’t need to enumerate scenarios to catch edge cases. You can basically go with “Find it this text has NSFW content” and it will work most of the time. E.g. u/heljakka from Root Signals has written at length about the LLM-as-judge.