llamafu / cognisoc

on-device · mobile

Why on-device LLMs are no longer a science project

There used to be a clean answer to “where does the model run?” The model ran in a data center. Your app made an HTTPS call, paid per token, and waited a few hundred milliseconds for an answer. That was the deal.

The deal has changed. Quantized open-weights models in the 1B–8B range now score well on the benchmarks that matter to product teams, the phones in people’s pockets have multi-core CPUs and competent GPUs, and llama.cpp has quietly become the most portable inference engine on the planet. The result is that running a real LLM directly on a phone is no longer a stunt. It is a shipping path.

llamafu is the Flutter side of that shipping path. It’s a Flutter FFI plugin, built on top of llama.cpp, that targets Android (API 21+) and iOS (12.0+), loads any GGUF-format model, and exposes a friendly Dart API for text generation, chat, embeddings, tool calling, structured JSON, multimodal input, and LoRA adapters. The whole thing runs in your app’s process. There is no server. There is no API key.

This post is the case for taking on-device inference seriously, framed around the four reasons mobile teams actually pick it.

1. Privacy is a feature, not a footnote

The single most repeated request on the llamafu side is the same: “we have data we can’t send anywhere.” Medical notes. Legal drafts. Personal journals. Financial transactions. Internal documents under an NDA. The list is long, and “we promise we won’t train on it” stopped being a sufficient answer some time ago.

On-device inference solves this at the architecture level. With llamafu, the user’s prompt never leaves the device. The model file lives in app storage, the tokenizer runs in your process, llama.cpp produces tokens locally, and the response comes back through a Dart Stream. There is no third-party inference provider to audit, no egress logs to scrub, no DPA to negotiate. That is a different conversation with security and legal than the cloud path, and it tends to be a much shorter one.

This matters even in apps that aren’t strictly regulated. “Your prompts stay on your phone” is a marketing line you can put on the App Store page and mean it. That kind of promise is rare and hard to fake.

2. Latency is not the same thing as throughput

A cloud LLM endpoint can be very fast on throughput — tokens per second once the stream gets going. But the user doesn’t feel throughput, they feel the first response. And the first response on a cloud call includes DNS, TLS, queueing, KV-cache warmup, and the round-trip from the device to wherever the nearest region is. On a flaky train ride or a basement coffee shop, that budget evaporates.

A small quantized model running on the same phone has a different latency profile. There is no network hop. There is no queue. The model is already loaded into RAM, the KV-cache is already warm from the previous turn, and the first token comes back as soon as llama.cpp can produce it. For chat-style features, autocomplete, structured extraction, or any interaction where the user is staring at the screen, that difference is what separates “feels instant” from “feels broken.”

llamafu helps here in two ways. It exposes streaming, so you can render tokens as they arrive instead of waiting for the full completion. And it exposes the knobs — threads, contextSize, sampling parameters — that let you tune for first-token latency on your specific device target.

3. Cost models that don’t punish success

Every cloud-based LLM feature has the same hockey stick problem. If the feature works, more users use it, every interaction costs you tokens, and your bill grows linearly with engagement. There are mitigations — caching, smaller models for cheap requests, rate limits — but none of them change the underlying shape of the curve.

On-device flips it. The expensive parts of inference (the GPU cycles, the RAM, the silicon) are paid for by the user, on hardware they already own, without any per-token markup. You pay for the bandwidth to ship the model file once, and that’s the bill.

That changes which features are even worth building. Always-on summarization of incoming notifications? Free with llamafu, prohibitively expensive in the cloud. A writing assistant that runs on every keystroke? Same answer. The set of viable product ideas grows substantially when the marginal token cost drops to zero.

4. Offline is the unsung killer feature

Cloud LLM features have a uniform failure mode: they don’t work when the user is offline. Phones go offline more than we like to admit — airplanes, subways, international travel without a roaming plan, rural areas, cafés with unreliable Wi-Fi, the back of warehouses, fieldwork.

llamafu doesn’t care. The model is on the device. The runtime is on the device. The inference is local. As long as the app launches, the AI feature works. That is a meaningful UX improvement for the user, and it’s an operational improvement for you: there is no upstream incident that can take your feature down. llama.cpp doesn’t have a status page.

So what does this look like in code?

About as simple as you’d hope. After flutter pub add llamafu:

final llamafu = await Llamafu.init(
  modelPath: '/path/to/model.gguf',
  threads: 4,
  contextSize: 2048,
);

final result = await llamafu.complete(
  prompt: 'Explain quantum computing in simple terms:',
  maxTokens: 256,
  temperature: 0.7,
);

print(result);
llamafu.close();

That’s the whole loop: load a GGUF file, call complete(), clean up. From there you can layer on chat history, structured JSON output, tool calling, LoRA adapters for task-specific tuning, embeddings for on-device search, and multimodal input for vision models like LLaVA and Qwen2-VL.

The catch

There is no free lunch. A 4-bit quantized 7B model is meaningfully less capable than a frontier cloud model. Loading it takes a few seconds. It uses a few gigabytes of RAM. Battery drain is real if you’re generating for minutes at a stretch. These are engineering constraints to design around, not deal-breakers, and llamafu tries to make them explicit: you choose the context size, you choose the quantization, you choose when to call close() and free the native memory.

The teams who get the most out of llamafu treat on-device as a different design space, not as “cloud minus the network.” They pick smaller models deliberately. They use grammar-constrained generation or structured JSON to get reliable outputs from less capable models. They stream tokens to make the interaction feel fast. They cache prompts in the KV-cache between turns. And they accept that some features still belong in the cloud — and design for both worlds.

If that sounds like the kind of work your team is doing, the docs are a good next step.


← Back to all posts