Better quantized support investigation

PR #96 added initial support for quantization and quantized operations.
However, the example claims that >100 tok/s is achievable.

We need to investigate the example for any improvements we can make, and possibly import it as a test or benchmark

> Running 8B parameter models requires 16GB+ at fp16. With 4-bit quantization, the same model fits in ~5GB, enabling inference on consumer Macs. This work is part of a broader effort to bring production LLM inference to the Elixir ecosystem:
> 
> | Repository | Purpose |
> |------------|---------|
> | [bobby_posts](https://github.com/notactuallytreyanastasio/bobby_posts) | Pure Elixir Qwen3-8B inference (135 tok/s) |
> | [bobby_posts_adapters](https://github.com/notactuallytreyanastasio/bobby_posts_adapters) | LoRA fine-tuning for personalized generation |
> | [bumblebee_quantized](https://github.com/notactuallytreyanastasio/bumblebee_quantized) | Quantized model loading for Bumblebee |
> | [safetensors_ex](https://github.com/notactuallytreyanastasio/safetensors_ex) | MLX 4-bit safetensors format support |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better quantized support investigation #108

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repository	Purpose
bobby_posts	Pure Elixir Qwen3-8B inference (135 tok/s)
bobby_posts_adapters	LoRA fine-tuning for personalized generation
bumblebee_quantized	Quantized model loading for Bumblebee
safetensors_ex	MLX 4-bit safetensors format support

Better quantized support investigation #108

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions