ruby_llm-contract

The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for ruby_llm.

  YOU WRITE                       THE GEM HANDLES                 YOU GET
  ─────────                       ───────────────                 ───────

  validate { |o| ... }            catch bad answers — combined     Zero garbage
                                  with retry_policy, auto-retry   in production

  retry_policy                    start cheap, escalate only      Pay for the cheapest
  models: %w[nano mini full]      when validation fails           model that works

  max_cost 0.01                   estimate tokens, check price,   No surprise bills
                                  refuse before calling LLM

  output_schema { ... }           send JSON schema to provider,   Zero parsing code
                                  validate response client-side

  define_eval { ... }             test cases + baselines,          Regressions caught
                                  run in CI with real LLM          before deploy

  recommend(candidates: [...])    evaluate all configs, pick      Optimal model +
                                  cheapest that passes            retry chain

Before and after

  ┌─────────────────────────────────────────────────────────────────┐
  │ BEFORE: pick one model, hope for the best                      │
  │                                                                 │
  │   expensive model → accurate, but you overpay on every call     │
  │   cheap model     → fast, but wrong answers slip to production  │
  │   prompt change   → "looks good to me" → deploy → users suffer │
  └─────────────────────────────────────────────────────────────────┘

                         ⬇  add ruby_llm-contract

  ┌─────────────────────────────────────────────────────────────────┐
  │ YOU DEFINE A CONTRACT                                            │
  │                                                                 │
  │   output_schema { string :priority }       ← valid structure   │
  │   validate("valid priority") { |o| ... }   ← business rules    │
  │   retry_policy models: %w[nano mini full]  ← escalation chain  │
  │   max_cost 0.01                            ← budget cap         │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ THE GEM HANDLES THE REST                                        │
  │                                                                 │
  │   request ──→ ┌──────┐   ┌──────────┐                           │
  │               │ nano │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └────┬─────┘                           │
  │                               │ ✗ fail                          │
  │                               ▼                                 │
  │               ┌──────┐   ┌──────────┐                           │
  │               │ mini │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └────┬─────┘                           │
  │                               │ ✗ fail                          │
  │                               ▼                                 │
  │               ┌──────┐   ┌──────────┐                           │
  │               │ full │─→ │ contract │──→ ✓ pass → done         │
  │               └──────┘   └──────────┘                           │
  └───────────────────────────┬─────────────────────────────────────┘
                              │
                              ▼
  ┌─────────────────────────────────────────────────────────────────┐
  │ YOU GET                                                         │
  │                                                                 │
  │   ✓ Valid output guaranteed — schema + business rules enforced  │
  │   ✓ Cheapest model that works — most requests stay on nano     │
  │   ✓ Cost, latency, tokens — tracked on every call              │
  │   ✓ Eval scores per model — data instead of gut feeling        │
  │   ✓ Regressions caught — before deploy, not after              │
  │   ✓ Recommendation — "use nano+mini, drop full, save $X/mo"   │
  └─────────────────────────────────────────────────────────────────┘

30-second version

class ClassifyTicket < RubyLLM::Contract::Step::Base
  prompt "Classify this support ticket by priority and category.\n\n{input}"

  output_schema do
    string :priority, enum: %w[low medium high urgent]
    string :category
  end

  validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
  retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
end

result = ClassifyTicket.run("I was charged twice")
result.parsed_output  # => {priority: "high", category: "billing"}
result.trace[:model]  # => "gpt-4.1-nano" (first model that passed)
result.trace[:cost]   # => 0.000032

Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.

Install

gem "ruby_llm-contract"

RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }

Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).

Save money with model escalation

Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:

retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]

Attempt 1: gpt-4.1-nano  → contract failed  ($0.0001)
Attempt 2: gpt-4.1-mini  → contract passed  ($0.0004)
           gpt-4.1       → never called      ($0.00)

Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run compare_models and find out.

Know which model to use — with data

Don't guess. Define test cases, compare models, get numbers:

ClassifyTicket.define_eval("regression") do
  add_case "billing", input: "I was charged twice", expected: { priority: "high" }
  add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
  add_case "outage",  input: "Database is down",    expected: { priority: "urgent" }
end

comparison = ClassifyTicket.compare_models("regression",
  models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])

Candidate                  Score       Cost  Avg Latency
---------------------------------------------------------
gpt-4.1-nano                0.67    $0.0001         48ms
gpt-4.1-mini                1.00    $0.0004         92ms
gpt-4.1                     1.00    $0.0021        210ms

Cheapest at 100%: gpt-4.1-mini

Nano fails on edge cases. Mini and full both score 100% — but mini is 5x cheaper. Now you know.

Let the gem tell you what to do

Don't read tables — get a recommendation. Supports model + reasoning_effort combinations:

rec = ClassifyTicket.recommend("regression",
  candidates: [
    { model: "gpt-4.1-nano" },
    { model: "gpt-4.1-mini" },
    { model: "gpt-5-mini", reasoning_effort: "low" },
    { model: "gpt-5-mini", reasoning_effort: "high" },
  ],
  min_score: 0.95
)

rec.best           # => { model: "gpt-4.1-mini" }
rec.retry_chain    # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
rec.to_dsl         # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
rec.savings        # => savings vs your current model (if configured)

Copy rec.to_dsl into your step. Done.

Catch regressions before users do

A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:

# Save a baseline once:
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_baseline!(model: "gpt-4.1-nano")

# In CI — block merge if anything regressed:
expect(ClassifyTicket).to pass_eval("regression")
  .with_context(model: "gpt-4.1-nano")
  .without_regressions

diff = report.compare_with_baseline(model: "gpt-4.1-nano")
diff.regressed?    # => true
diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
diff.score_delta   # => -0.33

No more "it worked in the playground". Regressions are caught in CI, not production.

A/B test your prompts

Changed a prompt? Compare old vs new on the same dataset with regression safety:

diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
  eval: "regression", model: "gpt-4.1-mini")

diff.safe_to_switch?  # => true (no regressions)
diff.improvements     # => [{case: "outage", ...}]
diff.score_delta      # => +0.33

# CI gate:
expect(ClassifyTicketV2).to pass_eval("regression")
  .compared_with(ClassifyTicketV1)
  .with_minimum_score(0.8)

Chain steps with fail-fast

Pipeline stops at the first contract failure — no wasted tokens on downstream steps:

class TicketPipeline < RubyLLM::Contract::Pipeline::Base
  step ClassifyTicket,  as: :classify
  step RouteToTeam,     as: :route
  step DraftResponse,   as: :draft
end

result = TicketPipeline.run("I was charged twice")
result.outputs_by_step[:classify]   # => {priority: "high", category: "billing"}
result.trace.total_cost             # => $0.000128

Gate merges on quality and cost

# RSpec — block merge if accuracy drops or cost spikes
expect(ClassifyTicket).to pass_eval("regression")
  .with_minimum_score(0.8)
  .with_maximum_cost(0.01)

# Rake — run all evals across all steps
RubyLLM::Contract::RakeTask.new do |t|
  t.minimum_score = 0.8
  t.maximum_cost = 0.05
end
# bundle exec rake ruby_llm_contract:eval

Docs

Guide
Getting Started	Features walkthrough, model escalation, eval
Eval-First	Practical workflow for prompt engineering with datasets, baselines, and A/B gates
Best Practices	6 patterns for bulletproof validates
Output Schema	Full schema reference + constraints
Pipeline	Multi-step composition, timeout, fail-fast
Testing	Test adapter, RSpec matchers
Migration	Adopting the gem in existing Rails apps

Roadmap

v0.6 (current): "What should I do?" — Step.recommend returns optimal model, reasoning effort, and retry chain. Per-attempt reasoning_effort in retry policies.

v0.5: Prompt A/B testing with compare_with. Soft observations with observe.

v0.4: Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.

v0.3: Baseline regression detection, migration guide.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
docs		docs
examples		examples
lib/ruby_llm		lib/ruby_llm
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.rubycritic.yml		.rubycritic.yml
.simplecov		.simplecov
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
ruby_llm-contract.gemspec		ruby_llm-contract.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ruby_llm-contract

Before and after

30-second version

Install

Save money with model escalation

Know which model to use — with data

Let the gem tell you what to do

Catch regressions before users do

A/B test your prompts

Chain steps with fail-fast

Gate merges on quality and cost

Docs

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ruby_llm-contract

Before and after

30-second version

Install

Save money with model escalation

Know which model to use — with data

Let the gem tell you what to do

Catch regressions before users do

A/B test your prompts

Chain steps with fail-fast

Gate merges on quality and cost

Docs

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages