PR #96 added initial support for quantization and quantized operations.
However, the example claims that >100 tok/s is achievable.
We need to investigate the example for any improvements we can make, and possibly import it as a test or benchmark
Running 8B parameter models requires 16GB+ at fp16. With 4-bit quantization, the same model fits in ~5GB, enabling inference on consumer Macs. This work is part of a broader effort to bring production LLM inference to the Elixir ecosystem:
PR #96 added initial support for quantization and quantized operations.
However, the example claims that >100 tok/s is achievable.
We need to investigate the example for any improvements we can make, and possibly import it as a test or benchmark