The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for ruby_llm.
YOU WRITE THE GEM HANDLES YOU GET
───────── ─────────────── ───────
validate { |o| ... } catch bad answers — combined Zero garbage
with retry_policy, auto-retry in production
retry_policy start cheap, escalate only Pay for the cheapest
models: %w[nano mini full] when validation fails model that works
max_cost 0.01 estimate tokens, check price, No surprise bills
refuse before calling LLM
output_schema { ... } send JSON schema to provider, Zero parsing code
validate response client-side
define_eval { ... } test cases + baselines, Regressions caught
run in CI with real LLM before deploy
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
cheapest that passes retry chain
┌─────────────────────────────────────────────────────────────────┐
│ BEFORE: pick one model, hope for the best │
│ │
│ expensive model → accurate, but you overpay on every call │
│ cheap model → fast, but wrong answers slip to production │
│ prompt change → "looks good to me" → deploy → users suffer │
└─────────────────────────────────────────────────────────────────┘
⬇ add ruby_llm-contract
┌─────────────────────────────────────────────────────────────────┐
│ YOU DEFINE A CONTRACT │
│ │
│ output_schema { string :priority } ← valid structure │
│ validate("valid priority") { |o| ... } ← business rules │
│ retry_policy models: %w[nano mini full] ← escalation chain │
│ max_cost 0.01 ← budget cap │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ THE GEM HANDLES THE REST │
│ │
│ request ──→ ┌──────┐ ┌──────────┐ │
│ │ nano │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └────┬─────┘ │
│ │ ✗ fail │
│ ▼ │
│ ┌──────┐ ┌──────────┐ │
│ │ mini │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └────┬─────┘ │
│ │ ✗ fail │
│ ▼ │
│ ┌──────┐ ┌──────────┐ │
│ │ full │─→ │ contract │──→ ✓ pass → done │
│ └──────┘ └──────────┘ │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ YOU GET │
│ │
│ ✓ Valid output guaranteed — schema + business rules enforced │
│ ✓ Cheapest model that works — most requests stay on nano │
│ ✓ Cost, latency, tokens — tracked on every call │
│ ✓ Eval scores per model — data instead of gut feeling │
│ ✓ Regressions caught — before deploy, not after │
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
└─────────────────────────────────────────────────────────────────┘
class ClassifyTicket < RubyLLM::Contract::Step::Base
prompt "Classify this support ticket by priority and category.\n\n{input}"
output_schema do
string :priority, enum: %w[low medium high urgent]
string :category
end
validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
end
result = ClassifyTicket.run("I was charged twice")
result.parsed_output # => {priority: "high", category: "billing"}
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
result.trace[:cost] # => 0.000032Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
gem "ruby_llm-contract"RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
gpt-4.1 → never called ($0.00)
Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run compare_models and find out.
Don't guess. Define test cases, compare models, get numbers:
ClassifyTicket.define_eval("regression") do
add_case "billing", input: "I was charged twice", expected: { priority: "high" }
add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
add_case "outage", input: "Database is down", expected: { priority: "urgent" }
end
comparison = ClassifyTicket.compare_models("regression",
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])Candidate Score Cost Avg Latency
---------------------------------------------------------
gpt-4.1-nano 0.67 $0.0001 48ms
gpt-4.1-mini 1.00 $0.0004 92ms
gpt-4.1 1.00 $0.0021 210ms
Cheapest at 100%: gpt-4.1-mini
Nano fails on edge cases. Mini and full both score 100% — but mini is 5x cheaper. Now you know.
Don't read tables — get a recommendation. Supports model + reasoning_effort combinations:
rec = ClassifyTicket.recommend("regression",
candidates: [
{ model: "gpt-4.1-nano" },
{ model: "gpt-4.1-mini" },
{ model: "gpt-5-mini", reasoning_effort: "low" },
{ model: "gpt-5-mini", reasoning_effort: "high" },
],
min_score: 0.95
)
rec.best # => { model: "gpt-4.1-mini" }
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
rec.savings # => savings vs your current model (if configured)Copy rec.to_dsl into your step. Done.
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
# Save a baseline once:
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_baseline!(model: "gpt-4.1-nano")
# In CI — block merge if anything regressed:
expect(ClassifyTicket).to pass_eval("regression")
.with_context(model: "gpt-4.1-nano")
.without_regressionsdiff = report.compare_with_baseline(model: "gpt-4.1-nano")
diff.regressed? # => true
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
diff.score_delta # => -0.33No more "it worked in the playground". Regressions are caught in CI, not production.
Changed a prompt? Compare old vs new on the same dataset with regression safety:
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
eval: "regression", model: "gpt-4.1-mini")
diff.safe_to_switch? # => true (no regressions)
diff.improvements # => [{case: "outage", ...}]
diff.score_delta # => +0.33# CI gate:
expect(ClassifyTicketV2).to pass_eval("regression")
.compared_with(ClassifyTicketV1)
.with_minimum_score(0.8)Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
step ClassifyTicket, as: :classify
step RouteToTeam, as: :route
step DraftResponse, as: :draft
end
result = TicketPipeline.run("I was charged twice")
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
result.trace.total_cost # => $0.000128# RSpec — block merge if accuracy drops or cost spikes
expect(ClassifyTicket).to pass_eval("regression")
.with_minimum_score(0.8)
.with_maximum_cost(0.01)
# Rake — run all evals across all steps
RubyLLM::Contract::RakeTask.new do |t|
t.minimum_score = 0.8
t.maximum_cost = 0.05
end
# bundle exec rake ruby_llm_contract:eval| Guide | |
|---|---|
| Getting Started | Features walkthrough, model escalation, eval |
| Eval-First | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
| Best Practices | 6 patterns for bulletproof validates |
| Output Schema | Full schema reference + constraints |
| Pipeline | Multi-step composition, timeout, fail-fast |
| Testing | Test adapter, RSpec matchers |
| Migration | Adopting the gem in existing Rails apps |
v0.6 (current): "What should I do?" — Step.recommend returns optimal model, reasoning effort, and retry chain. Per-attempt reasoning_effort in retry policies.
v0.5: Prompt A/B testing with compare_with. Soft observations with observe.
v0.4: Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
v0.3: Baseline regression detection, migration guide.
MIT