I Tried Running AI Agents on My MacBook. MLX Was Too Slow. Then I Found oMLX | by Andrew Zhu | May, 2026 | GoPenAI

Raw MLX was too slow for local AI agents on MacBooks when using a 35B model with long contexts, especially during prefill.
On an M1 Max, an 8,700-token context took about 20.35 seconds with raw MLX but only 2.95 seconds with oMLX.
On an M4 Max, the same 8,700-token context dropped to about 1.01 second with oMLX.
oMLX did not materially improve generation speed, but generation was already fast at about 42 tok/s on M1 Max and 95 tok/s on M4 Max.
The biggest benefit of oMLX was prefill acceleration, improving about 5.1x on M1 Max and 5.7x on M4 Max.
oMLX uses a tiered KV cache that keeps context in memory or on SSD so later requests can reuse it instead of rereading the full context.
The server also adds continuous batching, an OpenAI-compatible API, and multi-model serving.
The author says these features make oMLX practical for tools like Cursor and Claude Code.
The conclusion is that oMLX makes an older M1 Max viable for real AI agent work, though denser models may still need newer Macs like an M4 Max or M5 Max.

Your notes

Save this item to your library to add private notes.