gguf · performance
Picking a GGUF quantization for mobile
Once you decide to ship a model on the device, the next decision lands in your lap fast: which file? Open up Hugging Face, search for a popular model like LLaMA 3 8B Instruct, and you’ll find ten or fifteen GGUF variants. Q2_K. Q3_K_S. Q4_0. Q4_K_M. Q5_K_M. Q6_K. Q8_0. F16. The naming is dense and the tradeoffs are not obvious until you’ve tried a few. This post is the short version of what those names mean, what they cost on a phone, and how llamafu fits into the picture.
What “quantization” actually does
Transformer weights are originally trained as 16-bit or 32-bit floating point numbers. That precision is necessary during training; it’s overkill during inference. Quantization is the process of compressing each weight to fewer bits — typically 4 or 5 or 8 — using clever schemes that preserve as much of the model’s behavior as possible.
The benefits are immediate and large:
- The model file shrinks dramatically. A 7B-parameter model in F16 is around 13GB. The same model in Q4_K_M is around 4GB.
- RAM usage during inference drops by the same factor.
- On many phones, lower-precision arithmetic is also faster.
The cost is some quality. A heavily quantized model will hallucinate more, follow instructions less precisely, and degrade on tasks that need fine-grained reasoning. The art is finding the right point on that curve for your device class and your feature.
The GGUF naming convention, decoded
GGUF is the file format llama.cpp uses, and llamafu loads any GGUF file through the same mechanism. The quantization scheme is in the filename:
- F16 / F32 — full 16-bit or 32-bit precision. Reference quality. Generally too large for phones.
- Q8_0 — 8-bit, “0” variant. Nearly identical to F16 in behavior, but half the file size. The right choice when you have headroom and want maximum quality.
- Q6_K — 6-bit, “K” variant (a more advanced grouping). Very close to Q8 in quality at a noticeably smaller size.
- Q5_K_M — 5-bit, “K” grouping, “medium” mix. A good quality/size point on bigger devices.
- Q4_K_M — 4-bit, “K” grouping, “medium” mix. The mobile sweet spot.
- Q4_0 — 4-bit, simpler scheme. Slightly faster on some hardware, slightly worse quality than Q4_K_M.
- Q3_K_S / Q2_K — 3-bit and 2-bit. Smallest, fastest, lowest quality. Use only when the alternative is not shipping at all.
For mobile, the README is opinionated and you should listen to it:
Recommended quantizations for mobile: Q4_K_M (best quality/size), Q4_0 (fastest), Q8_0 (highest quality).
That’s the menu. Below is how to pick.
Pick by device class first
The single biggest constraint is RAM. The model has to fit, and the device needs enough additional RAM to run the rest of your app, the OS, and the KV-cache for your context window. As a rough guide for a 7B-class model:
- Q4_K_M (~4 GB) — fits comfortably on a modern Android flagship or a recent iPhone. The default recommendation.
- Q8_0 (~7 GB) — only on devices with 8GB+ of RAM, and you should still watch the headroom.
- Q4_0 (~3.8 GB) — similar footprint to Q4_K_M, slightly faster on some CPUs, slightly lower quality. Good for older mid-range Androids.
- Smaller models (Phi-3 Mini, TinyLlama, Gemma 2B) in Q4_K_M — under 2 GB. The right call when you need to support a wide range of devices.
The hard part is that you typically support a range of device tiers, not
one. That’s where llamafu makes life easier: the modelPath is just a
string. Nothing in your app code changes when you swap from a Q4_K_M build
to a Q8_0 build. You can ship multiple model files with the app, detect
device RAM at first launch, and pick.
final modelPath = await pickModelForDevice(); // your logic
final llamafu = await Llamafu.init(
modelPath: modelPath,
threads: 4,
contextSize: 2048,
);
Pick by task second
Different tasks degrade differently under quantization. Some rules of thumb from the on-device community:
- Free-form chat and writing are surprisingly robust. Q4_K_M of a good 7B model is fine for casual chat features.
- Code generation is more sensitive. If you’re shipping a code feature, step up to Q5_K_M or Q8_0 on a smaller code model (Code LLaMA, DeepSeek Coder) rather than aggressively quantizing a larger general model.
- Structured output — the kind you’d produce with llamafu’s
generateJson()orcompleteWithGrammar()— is also relatively robust, because the grammar constraints catch many quantization artifacts before they reach the user. - Math and multi-step reasoning are the most sensitive. If your feature needs careful logic, either pick a model trained for it or accept that on-device may not be the right tier.
Pick by context window third
The KV-cache scales with the context size. A 2048-token context on a 7B model uses around 256 MB of additional RAM; 8192 tokens uses more than 1 GB. On mobile you want to be conservative.
The llamafu performance guide recommends starting with contextSize: 2048
and increasing only if you need it. That’s good advice. If your feature is
“summarize this paragraph,” 1024 tokens is plenty. If your feature is “chat
with conversation history,” 4096 may be necessary. If your feature is “read
this 10-page document,” consider whether on-device is still the right tier
for that specific path.
A practical starting point
For most teams adding their first on-device LLM feature, here’s a reasonable starting configuration:
final llamafu = await Llamafu.init(
modelPath: '/path/to/llama-3-8b-instruct.Q4_K_M.gguf',
threads: 4,
contextSize: 2048,
);
A 7B-class model in Q4_K_M, four threads, a 2048-token context. Measure first-token latency, tokens-per-second, memory usage, and battery on your device matrix. If you’re shipping to a wide range of hardware, layer in a fallback to a smaller model (Phi-3 Mini, Gemma 2B) for older devices.
And remember: this is one of the few decisions in on-device inference that you can change without changing your app architecture. The GGUF file is a swap; the Dart code stays the same. Iterate.
For the full performance guide, see the docs.
← Back to all posts