Skip to content

Better quantized support investigation #108

@polvalente

Description

@polvalente

PR #96 added initial support for quantization and quantized operations.
However, the example claims that >100 tok/s is achievable.

We need to investigate the example for any improvements we can make, and possibly import it as a test or benchmark

Running 8B parameter models requires 16GB+ at fp16. With 4-bit quantization, the same model fits in ~5GB, enabling inference on consumer Macs. This work is part of a broader effort to bring production LLM inference to the Elixir ecosystem:

Repository Purpose
bobby_posts Pure Elixir Qwen3-8B inference (135 tok/s)
bobby_posts_adapters LoRA fine-tuning for personalized generation
bumblebee_quantized Quantized model loading for Bumblebee
safetensors_ex MLX 4-bit safetensors format support

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions