Obsidian Metadata
| source | https://www.reddit.com/r/LocalLLaMA/comments/1mo5s79/why_evals_are_the_missing_piece_in_most_ai/ |
| author | dinkinflika0 |
| published | 2025-08-12 |
| description |
I keep seeing AI teams obsess over model choice, prompts, and infrastructure, but very few invest in structured evals early. Without them, you are basically shipping blind. In my experience, good eval workflows catch issues before they hit production, shorten iteration cycles, and prevent those “works in testing, fails in prod” disasters.
At Maxim AI we’ve seen teams slash AI feature rollout time just by setting up continuous eval loops with both human and automated tests. If your AI product handles real user-facing tasks, you cannot rely on spot checks. You need evals that mimic the exact scenarios your users will throw at the system.
What’s your take, are evals an engineering must-have or just a nice-to-have?
Comments
HiddenoO • 8 points •
grab obtainable ghost school lunchroom childlike innate rainstorm gray oil
This post was mass deleted and anonymized with Redact
track me
SkyFeistyLlama8 • 1 points •
How about running evals in production and stopping catastrophically bad generations? I’ve had better luck with running multiple simultaneous prompts to check if some agentic behavior has been fulfilled, and then doing a voting mechanism to proceed to the next step.
It’s kind of crazy to be thinking of software outputs as probabilistic.
HiddenoO • 1 points •
modern historical melodic amusing bow lock grandfather joke imminent soup
This post was mass deleted and anonymized with Redact
No_Efficiency_1144 • 1 points •
Black Swan theory is huge for evals yes.
the__storm • 1 points •
I agree, I think having some metric to iterate and validate against is absolutely essential, but it’s an endless battle with management getting time to work on evals (and if SMEs are needed, getting data in a useable format from them).
No_Efficiency_1144 • 1 points •
Reward hacking is the biggest issue but I don’t really have a solution LOL.
Truly exactly verifiable rewards are clearly best which mostly applies to math. With Lean4 it applies to code sometimes.
Outside of that it gets trickier but ensembles of benchmarks or verifiers can help. LLM-as-judge is probably under-rated because people mostly expect it to be horrible but I think it could potentially be a bit less prone to reward hacking especially in ensemble.
paradite • 1 points •
Yeah, evals are very important for ALL AI products.
However, engineers are too busy building. Product managers / founders lack the skills to setup their own evals. So they resort to vibe testing the prompts, which is actually better than LLM-as-judge (due to hallucination) in terms of accuracy, but time-consuming.
I am building a simple tool that help non-technical people set up their evals quickly on their local computer. Seems like a worthy problem to solve.
nore_se_kra • 1 points •
I think often the actual issue is not missing evals but missing eval data.
Depending on the use case, its very expensive to get proper data as all your domain experts ( who might be overloaded already) have to be involved . This is definitely not what the management has in mind when they want to speed up things with AI. So people vibe prompting themselves through PoCs into some kind of bad production that fails fast.
From my point of view the eval data is the actual value and how a company can set itself apart from the competition . Not platforms and frameworks and whatnot everyone wants to sell
ManInTheMoon__48 • 1 points •
How do you even decide what scenarios to include in evals? Feels like you could miss edge cases anyway
recursive_dev • 1 points •
With judge LLMs, you don’t need to enumerate scenarios to catch edge cases. You can basically go with “Find it this text has NSFW content” and it will work most of the time. E.g. u/heljakka from Root Signals has written at length about the LLM-as-judge.

