Published in GoPenAI7 min read1453 wordseng
oMLX makes long-context local agents practical on Macs by cutting 35B prefill from 20.35s to 2.95s on M1 Max and to 1.01s on M4 Max via tiered KV cache, while generation stays fast and it adds batching, OpenAI API, and multi-model serving.
- • The author found raw Apple MLX too slow for running local AI agents on MacBooks with a 35B model, especially for long-context prefill.
- • On an M1 Max, an 8,700-token context took about 20.35 seconds with raw MLX but only 2.95 seconds with oMLX, while on an M4 Max it dropped to about 1.01 second.
- • oMLX did not significantly improve token generation speed, but generation was already fast enough at roughly 42 tok/s on M1 Max and 95 tok/s on M4 Max.
- • The main performance gain from oMLX came from prefill speed, which improved about 5.1x on M1 Max and 5.7x on M4 Max.
- • oMLX’s tiered KV cache keeps context in memory or on SSD so later requests can reuse it instead of re-reading the entire context from scratch.
- • The article says oMLX adds continuous batching, an OpenAI-compatible API, and multi-model serving, making it practical for tools like Cursor or Claude Code.
- • The author concludes that oMLX makes even an older M1 Max usable for real AI agent work, while more demanding dense models may still need newer Macs like an M4 Max or M5 Max.