Skip to content

justi/ruby_llm-contract

Repository files navigation

ruby_llm-contract

The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for ruby_llm.

  YOU WRITE                       THE GEM HANDLES                 YOU GET
  ─────────                       ───────────────                 ───────

  validate { |o| ... }            catch bad answers — combined     Zero garbage
                                  with retry_policy, auto-retry   in production

  retry_policy                    start cheap, escalate only      Pay for the cheapest
  models: %w[nano mini full]      when validation fails           model that works

  max_cost 0.01                   estimate tokens, check price,   No surprise bills
                                  refuse before calling LLM

  output_schema { ... }           send JSON schema to provider,   Zero parsing code
                                  validate response client-side

  define_eval { ... }             test cases + baselines,          Regressions caught
                                  run in CI with real LLM          before deploy

  recommend(candidates: [...])    evaluate all configs, pick      Optimal model +
                                  cheapest that passes            retry chain

Before and after

  ┌─────────────────────────────────────────────────────────────────┐
  │ BEFORE: pick one model, hope for the best                      │
  │                                                                 │
  │   expensive model → accurate, but you overpay on every call     │
  │   cheap model     → fast, but wrong answers slip to production  │
  │   prompt change   → "looks good to me" → deploy → users suffer │
  └─────────────────────────────────────────────────────────────────┘

                         ⬇  add ruby_llm-contract

  ┌─────────────────────────────────────────────────────────────────┐
  │ YOU DEFINE A CONTRACT                                            │
  │                                                                 │
  │   output_schema { string :priority }       ← valid structure   │
  │   validate("valid priority") { |o| ... }   ← business rules    │
  │   retry_policy models: %w[nano mini full]  ← escalation chain  │
  │   max_cost 0.01                            ← budget cap         │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ THE GEM HANDLES THE REST                                        │
  │                                                                 │
  │   request ──→ ┌──────┐   ┌──────────┐                           │
  │               │ nano │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └────┬─────┘                           │
  │                               │ ✗ fail                          │
  │                               ▼                                 │
  │               ┌──────┐   ┌──────────┐                           │
  │               │ mini │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └────┬─────┘                           │
  │                               │ ✗ fail                          │
  │                               ▼                                 │
  │               ┌──────┐   ┌──────────┐                           │
  │               │ full │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └──────────┘                           │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ YOU GET                                                         │
  │                                                                 │
  │   ✓ Valid output guaranteed — schema + business rules enforced  │
  │   ✓ Cheapest model that works — most requests stay on nano     │
  │   ✓ Cost, latency, tokens — tracked on every call              │
  │   ✓ Eval scores per model — data instead of gut feeling        │
  │   ✓ Regressions caught — before deploy, not after              │
  │   ✓ Recommendation — "use nano+mini, drop full, save $X/mo"   │
  └─────────────────────────────────────────────────────────────────┘

30-second version

class ClassifyTicket < RubyLLM::Contract::Step::Base
  prompt "Classify this support ticket by priority and category.\n\n{input}"

  output_schema do
    string :priority, enum: %w[low medium high urgent]
    string :category
  end

  validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
  retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
end

result = ClassifyTicket.run("I was charged twice")
result.parsed_output  # => {priority: "high", category: "billing"}
result.trace[:model]  # => "gpt-4.1-nano" (first model that passed)
result.trace[:cost]   # => 0.000032

Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.

Install

gem "ruby_llm-contract"
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }

Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).

Save money with model escalation

Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:

retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
Attempt 1: gpt-4.1-nano  → contract failed  ($0.0001)
Attempt 2: gpt-4.1-mini  → contract passed  ($0.0004)
           gpt-4.1       → never called      ($0.00)

Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run compare_models and find out.

Know which model to use — with data

Don't guess. Define test cases, compare models, get numbers:

ClassifyTicket.define_eval("regression") do
  add_case "billing", input: "I was charged twice", expected: { priority: "high" }
  add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
  add_case "outage",  input: "Database is down",    expected: { priority: "urgent" }
end

comparison = ClassifyTicket.compare_models("regression",
  models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
Candidate                  Score       Cost  Avg Latency
---------------------------------------------------------
gpt-4.1-nano                0.67    $0.0001         48ms
gpt-4.1-mini                1.00    $0.0004         92ms
gpt-4.1                     1.00    $0.0021        210ms

Cheapest at 100%: gpt-4.1-mini

Nano fails on edge cases. Mini and full both score 100% — but mini is 5x cheaper. Now you know.

Let the gem tell you what to do

Don't read tables — get a recommendation. Supports model + reasoning_effort combinations:

rec = ClassifyTicket.recommend("regression",
  candidates: [
    { model: "gpt-4.1-nano" },
    { model: "gpt-4.1-mini" },
    { model: "gpt-5-mini", reasoning_effort: "low" },
    { model: "gpt-5-mini", reasoning_effort: "high" },
  ],
  min_score: 0.95
)

rec.best           # => { model: "gpt-4.1-mini" }
rec.retry_chain    # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
rec.to_dsl         # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
rec.savings        # => savings vs your current model (if configured)

Copy rec.to_dsl into your step. Done.

Catch regressions before users do

A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:

# Save a baseline once:
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_baseline!(model: "gpt-4.1-nano")

# In CI — block merge if anything regressed:
expect(ClassifyTicket).to pass_eval("regression")
  .with_context(model: "gpt-4.1-nano")
  .without_regressions
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
diff.regressed?    # => true
diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
diff.score_delta   # => -0.33

No more "it worked in the playground". Regressions are caught in CI, not production.

A/B test your prompts

Changed a prompt? Compare old vs new on the same dataset with regression safety:

diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
  eval: "regression", model: "gpt-4.1-mini")

diff.safe_to_switch?  # => true (no regressions)
diff.improvements     # => [{case: "outage", ...}]
diff.score_delta      # => +0.33
# CI gate:
expect(ClassifyTicketV2).to pass_eval("regression")
  .compared_with(ClassifyTicketV1)
  .with_minimum_score(0.8)

Chain steps with fail-fast

Pipeline stops at the first contract failure — no wasted tokens on downstream steps:

class TicketPipeline < RubyLLM::Contract::Pipeline::Base
  step ClassifyTicket,  as: :classify
  step RouteToTeam,     as: :route
  step DraftResponse,   as: :draft
end

result = TicketPipeline.run("I was charged twice")
result.outputs_by_step[:classify]   # => {priority: "high", category: "billing"}
result.trace.total_cost             # => $0.000128

Gate merges on quality and cost

# RSpec — block merge if accuracy drops or cost spikes
expect(ClassifyTicket).to pass_eval("regression")
  .with_minimum_score(0.8)
  .with_maximum_cost(0.01)

# Rake — run all evals across all steps
RubyLLM::Contract::RakeTask.new do |t|
  t.minimum_score = 0.8
  t.maximum_cost = 0.05
end
# bundle exec rake ruby_llm_contract:eval

Docs

Guide
Getting Started Features walkthrough, model escalation, eval
Eval-First Practical workflow for prompt engineering with datasets, baselines, and A/B gates
Best Practices 6 patterns for bulletproof validates
Output Schema Full schema reference + constraints
Pipeline Multi-step composition, timeout, fail-fast
Testing Test adapter, RSpec matchers
Migration Adopting the gem in existing Rails apps

Roadmap

v0.6 (current): "What should I do?" — Step.recommend returns optimal model, reasoning effort, and retry chain. Per-attempt reasoning_effort in retry policies.

v0.5: Prompt A/B testing with compare_with. Soft observations with observe.

v0.4: Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.

v0.3: Baseline regression detection, migration guide.

License

MIT

About

Know which LLM model to use, what it costs, and when accuracy drops. Companion gem for ruby_llm.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages