Ark resource preview
Evaluation Basics
A pragmatic approach to measuring AI value - economic outcomes over accuracy scores.

Author
Joe Draper
Founder, Arkwright
Most AI evaluation advice comes from research labs optimising for benchmark leaderboards. That's not what you need.
You need to know: Is this AI tool saving us money? Is it producing work we can actually use? Is it going to embarrass us in front of a client?
This guide covers practical evaluation for business and technical AI use cases, with a focus on economic value rather than academic metrics. We'll also cover when you do need rigorous evaluation - particularly for autonomous agents interacting with the outside world.
The Economic Lens
What You're Actually Measuring
Forget accuracy percentages in isolation. The question that matters is:
Does this AI tool produce more value than it costs - including the cost of fixing its mistakes?
That calculation has four components:
| Component | Question |
|---|---|
| Direct cost | What does the tool/API cost to run? |
| Time saved | How much human time does it replace or reduce? |
| Quality delta | Is the output better, worse, or equivalent to human work? |
| Error cost | What happens when it gets something wrong? |
A tool with 80% accuracy might be brilliant if errors are cheap to catch and fix. A tool with 95% accuracy might be worthless if the 5% failures are catastrophic.
The "Good Enough" Threshold
Most business tasks don't need perfection. They need "good enough to ship after a quick review."
Consider a tool that drafts customer emails:
- ➢90% usable as-is = excellent
- ➢8% need light editing = fine
- ➢2% need rewriting = acceptable if you're reviewing anyway
Compare that to what you'd accept from a new hire. You wouldn't expect 100% perfection from day one. You'd expect competent drafts that improve over time, with oversight proportional to risk.
Apply the same standard to AI tools.
When "Good Enough" Isn't
Some outputs genuinely need to be right:
- ➢Legal documents
- ➢Financial calculations
- ➢Medical advice
- ➢Public statements attributed to your company
- ➢Anything that creates liability
For these, the error cost is high enough that you need either rigorous evaluation or mandatory human review. Usually both.
Evaluating Business Tasks
The Simple Framework
For most business tasks, you need three things:
1. A sample set of real examples
Pull 20-50 actual tasks from your workflow. Not synthetic examples - real inputs you've already processed manually. Include:
Sealed vault
Full access is included with Arkwright Fractional
Finding this useful? You've only read about 12% of the full resource. Reach out to unlock the full guide and the rest of the Ark.
Want the full Ark unlocked?
Arkwright Fractional gives you complete access, plus hands-on support.