























I have an M4 Max with 128 GB of unified memory, and I thought it would be easy to reach decent inference speeds with it. After a few failed attempts to exceed about 150 t/s with completely custom Metal inference engines tailor-built by Claude, I'm stumped.
I'm not really sure how to make this hardware usable -- I can only really afford DeepSeek levels of pricing right now, but DeepSeek is slow and I'm really itching for something faster. Up until now, I've had a $200 per month Claude subscription, and Claude has been great, but the recent revocation of Fable 5 suddenly has me worried about losing access to whatever hosted model I choose to rely on, and of course I can't afford another month of Max 20x anyway, so DeepSeek will be pretty much my only option once this subscription period lapses (due to the lower Claude plans not being usable for me).
I want to figure out how to run something locally, but I don't want the speed to have to be even slower as a result. I've tried a few models already, and:
- Custom Qwen3-Coder-Next inference outperforms llama.cpp Q4_0 (70.9 t/s) and MLX 4-bit (80.6 t/s) at about 120 t/s, but that's still not really worth it
- Custom RWKV7-G1 inference reaches like 20,000 t/s prefill and 1000 t/s generation with the 0.1b model, and then pretty much falls over with the larger models -- hard enough that 1.5B already drops all the way to 140 t/s generation, so I'm not even going to bother getting 13.3B numbers
- Custom Qwen3.6-35B inference reaches around 250 t/s prefill and 85 t/s generation at 4-bit quantization
Each one of these was aggressively optimized with many detailed profiling passes to maximize GPU usage, minimize latency and eliminate dispatch overhead. (I started with Rust Burn, but eventually hit CubeCL's high latencies and moved to Swift + Metal)
It feels like everything I try degrades to about the same level -- 80 to 120 t/s -- once at any usable number of active parameters. It feels like some sort of wall and it's really frustrating -- I don't have another $7000 to drop on a brand new M5 Max in order to get the performance I need, even assuming matrix multiplications are the bottleneck.
Are there any competent models that could run at a usable speed on my hardware? I'm looking for at least 200t/s while being able to reason and call tools. Cerebras offers gpt-oss-120b at over 1000t/s but it's so expensive and also isn't able to properly call tools most of the time.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。