Ark resource preview

Evaluation Basics

A pragmatic approach to measuring AI value - economic outcomes over accuracy scores.

Joe Draper

Author

Joe Draper

Founder, Arkwright

GuideFractional vault

Most AI evaluation advice comes from research labs optimising for benchmark leaderboards. That's not what you need.

You need to know: Is this AI tool saving us money? Is it producing work we can actually use? Is it going to embarrass us in front of a client?

This guide covers practical evaluation for business and technical AI use cases, with a focus on economic value rather than academic metrics. We'll also cover when you do need rigorous evaluation - particularly for autonomous agents interacting with the outside world.

The Economic Lens

What You're Actually Measuring

Forget accuracy percentages in isolation. The question that matters is:

Does this AI tool produce more value than it costs - including the cost of fixing its mistakes?

That calculation has four components:

ComponentQuestion
Direct costWhat does the tool/API cost to run?
Time savedHow much human time does it replace or reduce?
Quality deltaIs the output better, worse, or equivalent to human work?
Error costWhat happens when it gets something wrong?

A tool with 80% accuracy might be brilliant if errors are cheap to catch and fix. A tool with 95% accuracy might be worthless if the 5% failures are catastrophic.

The "Good Enough" Threshold

Most business tasks don't need perfection. They need "good enough to ship after a quick review."

Consider a tool that drafts customer emails:

  • 90% usable as-is = excellent
  • 8% need light editing = fine
  • 2% need rewriting = acceptable if you're reviewing anyway

Compare that to what you'd accept from a new hire. You wouldn't expect 100% perfection from day one. You'd expect competent drafts that improve over time, with oversight proportional to risk.

Apply the same standard to AI tools.

When "Good Enough" Isn't

Some outputs genuinely need to be right:

  • Legal documents
  • Financial calculations
  • Medical advice
  • Public statements attributed to your company
  • Anything that creates liability

For these, the error cost is high enough that you need either rigorous evaluation or mandatory human review. Usually both.

Evaluating Business Tasks

The Simple Framework

For most business tasks, you need three things:

1. A sample set of real examples

Pull 20-50 actual tasks from your workflow. Not synthetic examples - real inputs you've already processed manually. Include:

Sealed vault

Full access is included with Arkwright Fractional

Finding this useful? You've only read about 12% of the full resource. Reach out to unlock the full guide and the rest of the Ark.

Want the full Ark unlocked?

Arkwright Fractional gives you complete access, plus hands-on support.