Granite Switch — Fine-tuning, finally composable

| Browse Adapters | Models on HF | Tutorials |

Task-specific fine-tuning delivers large accuracy gains on small models — but shipping a separate model per task is operationally painful. Granite Switch gives you the accuracy of many models with the footprint of one: compose a single checkpoint from our adapter library in minutes, then swap or upgrade individual capabilities as your needs change.

Browse the full set of ready-to-use adapters in the Granite Libraries collection on Hugging Face.

Key Features

Composable — Combine independently trained adapters into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.
Fast — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and high inference throughput.
Accurate — Task-specific adapters can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. For concrete benchmark example, see the Hallucination Detection from the RAG adapter library.
Inference-ready — Support for Hugging Face and vLLM.

Quick Start

Install

git clone https://github.com/generative-computing/granite-switch.git
cd granite-switch
python -m venv venv && source venv/bin/activate

# Pick what you need:
pip install -e ".[compose]" # Compose modular models
pip install -e ".[hf]"      # HuggingFace inference
pip install -e ".[vllm]"    # vLLM production inference (0.19.x)
pip install -e ".[vllm20]"  # vLLM 0.20+ (requires CUDA 13+)
pip install -e ".[dev]"     # Everything (uses vLLM 0.19.x by default)
pip install -e ".[dev-vllm20]" # Dev environment with vLLM 0.20+

Requires Python 3.9+ and PyTorch 2.0+.

vLLM version note: This project currently defaults to vLLM 0.19.1 due to vLLM 0.20's dependency on CUDA 13.0+ (via PyTorch 2.11), which is incompatible with many existing environments running CUDA 12.x drivers. Use .[vllm20] if your environment supports CUDA 13+.

Compose a Model

Combine a base Granite model with adapters into a single deployable checkpoint:

python -m granite_switch.composer.compose_granite_switch \
  --base-model ibm-granite/granite-4.1-3b \
  --adapters ibm-granite/granitelib-core-r1.0 ibm-granite/granitelib-rag-r1.0  ibm-granite/granitelib-guardian-r1.0 \
  --output ./my-model

This downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.

For convenience, you can find already composed Granite Switch models for the Granite 4.1 model family here:

Run Inference

vLLM + Mellea (recommended):

pip install mellea
python -m vllm.entrypoints.openai.api_server --model ./my-model --port 8000

from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.components.intrinsic import rag
from mellea.stdlib.context import ChatContext

backend = OpenAIBackend(
    model_id="./my-model",
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
backend.register_embedded_adapter_model("./my-model")

query = "I want to ask you something. what is...mmmm the the main city(capital you call it,right?) of France?"
ctx = ChatContext()

rewritten = rag.rewrite_question(query, ctx, backend)
print(f"original:  {query}")
print(f"rewritten: {rewritten}")
# => "What is the capital of France?"

HuggingFace:

import granite_switch.hf  # Register HF backend

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./my-model", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("./my-model")

messages = [{"role": "user", "content": "What is the capital of France?"}]
documents = [{"doc_id": "1", "text": "Paris is the capital of France."}]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    adapter_name="answerability",  # activates the answerability adapter
    add_generation_prompt=True,
    tokenize=False,
)
outputs = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# => "answerable"

How It Works

Granite Switch uses a switch layer—a small attention-based mechanism that reads control tokens from the input and determines which adapter's LoRA weights to apply at each position.

What makes composition work:

KV cache normalization — each adapter sees only the base model's KV cache, never another adapter's internal state
No joint training required — Adapters can be developed, tested, and published independently
Standard inference — The entire model loads in vLLM with zero code changes

Documentation

For detailed tutorials and many working examples, see the Tutorials section.

Citation

@software{granite_switch,
  title  = {Granite Switch: Coarse-Grained Expert Switching for LLMs},
  author = {IBM Research},
  year   = {2025},
  url    = {https://github.com/ibm-granite/granite-switch}
}

IBM ❤️ Open Source AI

Granite Switch was started by IBM Research.

License

Granite Switch has an Apache-2.0 license, as found in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
src/granite_switch		src/granite_switch
tests		tests
tutorials		tutorials
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Granite Switch — Fine-tuning, finally composable

Key Features

Quick Start

Install

Compose a Model

Run Inference

How It Works

Documentation

Citation

IBM ❤️ Open Source AI

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Granite Switch — Fine-tuning, finally composable

Key Features

Quick Start

Install

Compose a Model

Run Inference

How It Works

Documentation

Citation

IBM ❤️ Open Source AI

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages