Flutter FFI plugin · Android & iOS

On-device LLMs
for mobile apps.

Name: llamafu
Author: Cognisoc

Run AI models directly on mobile devices. No cloud. No latency. Complete privacy. llamafu is a Flutter plugin built on llama.cpp that runs GGUF models locally on Android (API 21+) and iOS (12.0+).

Get started → Get on pub.dev

$ flutter pub add llamafu

✓ 100% offline ✓ GGUF native ✓ MIT-licensed ✓ No per-token bill

main.dart

import 'package:llamafu/llamafu.dart';

// Load a GGUF model straight from disk — no server.
final llm = await Llamafu.init(
  modelPath: '/models/phi-3-mini-q4_k_m.gguf',
  threads: 4,
  contextSize: 2048,
);

// Stream tokens as they are generated.
await for (final token in llm.stream(
  prompt: 'Explain on-device AI:',
  maxTokens: 256,
)) {
  stdout.write(token);
}

llm.close();

What is llamafu?

A Flutter plugin that runs LLMs on the phone itself.

llamafu is a Flutter FFI plugin built on llama.cpp for running GGUF large language models fully on-device — on Android (API 21+) and iOS (12.0+). You add one package, point it at a .gguf file, and call a typed Dart API. There is no server, no API key, and no network round-trip: inference happens on the user's device, so data never leaves it.

Runtime

llama.cpp linked through a thin C++ layer, called from Dart via dart:ffi.

Format

GGUF — the same models the whole llama.cpp ecosystem uses. Bring your own from Hugging Face.

Surface

Dart 3.1+ / Flutter 3.10+. One typed API for text, chat, embeddings, vision, and tools.

Why on-device

The problems llamafu solves

Cloud LLM APIs are the default — and for many mobile features they are exactly the wrong default. Here is what changes when inference moves onto the device.

Cloud latency

The pain

Every prompt is a round-trip to a data centre. Users feel the wait, and it gets worse on flaky mobile networks.

With llamafu

Inference runs on the device. First token starts locally — no network hop, no cold-start on someone else’s GPU.

Data leaves the device

The pain

Sending user text, photos, or documents to a third-party API is a privacy and compliance liability you have to defend.

With llamafu

Nothing leaves the phone. Prompts, context, and outputs stay on-device — the strongest possible privacy story.

Needs a connection

The pain

Cloud AI simply stops working on a plane, in the field, in a hospital basement, or anywhere the signal drops.

With llamafu

llamafu works fully offline. Once the GGUF file ships or downloads, the feature runs with the radio off.

Per-token bills

The pain

Usage-based API pricing means your costs scale with success. A viral feature becomes a runaway invoice.

With llamafu

On-device inference has no per-token charge. Compute is the user’s device — your marginal cost is zero.

Capabilities

A full llama.cpp surface, in idiomatic Dart

Text, chat, embeddings, vision, tools, and grammar-constrained output — the things you expect from a llama.cpp runtime, exposed as a typed Flutter API.

Core inference

Everything you need to generate text from a GGUF model on-device.

Streaming text generation

Generate token-by-token with a Dart stream, so your UI can render output as it arrives.

Chat with history

Multi-turn chat completions with role-tagged messages and a managed conversation context.

Embeddings

Produce embedding vectors for semantic search, RAG, and clustering — all locally.

Tokenize / detokenize

Direct access to the model’s tokenizer for counting, truncation, and prompt budgeting.

Advanced capabilities

Structured, multimodal, and tool-augmented generation — on the device.

Vision / multimodal

Run vision-language models such as LLaVA and Qwen2-VL to reason over images on-device.

Tool & function calling

Let the model request tool calls, then feed the results back for agentic mobile workflows.

Structured JSON output

Constrain generation to a JSON schema so you get parseable objects, not free-form prose.

GBNF grammar constraints

Enforce a formal grammar during decoding to guarantee syntactically valid output.

Customization & control

Tune the runtime and the model to the device and the task.

LoRA adapters

Load and hot-swap LoRA adapters at runtime to specialise a base model without re-shipping it.

Sampling controls

Full control over temperature, top-k, top-p, and repetition penalties per request.

Context & threads

Configure context size and CPU thread count to match each device class and memory budget.

GPU acceleration

Offload compute to the GPU where the device and backend support it — CPU stays the reliable baseline.

Explore all features →

Code showcase

The API is small. The output is structured.

Chat, structured JSON, and vision — the same object, different methods. Sample code reflects the plugin’s API shape; see the docs for the full reference.

chat.dart — streaming chat

// Multi-turn chat with managed history.
final chat = llm.chat(system: 'You are a terse assistant.');

await for (final token in chat.send(
  'Summarise this note in one line.',
)) {
  setState(() => reply += token);
}

extract.dart — structured JSON

// Constrain output to a JSON schema — parseable, not prose.
final result = await llm.generateJson(
  prompt: 'Extract the invoice fields.',
  schema: {
    'type': 'object',
    'properties': {
      'total': {'type': 'number'},
      'currency': {'type': 'string'},
    },
  },
);

print(result['total']); // 42.00

vision.dart — multimodal

// Reason over an image with a vision model (LLaVA / Qwen2-VL).
final answer = await llm.describeImage(
  imagePath: receipt.path,
  prompt: 'What store is this from?',
);

Models

Bring your own GGUF

llamafu loads any model in the GGUF format used by llama.cpp — thousands of them are already on Hugging Face. Pick a quantization that fits your device class.

GENERAL

LLaMA 3, Mistral, Phi-3, Qwen2, Gemma 2

CODE

Code LLaMA, DeepSeek Coder, StarCoder2

VISION

LLaVA, Qwen2-VL, Moondream

SMALL / FAST

Phi-3 Mini, TinyLlama, Gemma 2B

Recommended quantizations for mobile: Q4_K_M for balanced quality, Q4_0 for speed, Q8_0 when you have memory headroom. See the quantization guide.

API 21+

Android (NDK 21+)

iOS 12.0+

Xcode 14+

GGUF

llama.cpp model format

MIT

Open-source license

Dart 3.1+ · Flutter 3.10+ · runtime backend: llama.cpp via dart:ffi

Ship it this sprint

Add an offline AI feature to your Flutter app.

Add llamafu, load a GGUF model, call stream(). No keys, no servers, no per-token bill.

Getting started → Get on pub.dev

Part of the Cognisoc stack

llamafu is Cognisoc’s Flutter runtime for on-device inference. Explore the rest of the stack.

Visit Cognisoc →

On-device LLMs for mobile apps.

A Flutter plugin that runs LLMs on the phone itself.

The problems llamafu solves

Cloud latency

Data leaves the device

Needs a connection

Per-token bills

A full llama.cpp surface, in idiomatic Dart

Core inference

Streaming text generation

Chat with history

Embeddings

Tokenize / detokenize

Advanced capabilities

Vision / multimodal

Tool & function calling

Structured JSON output

GBNF grammar constraints

Customization & control

LoRA adapters

Sampling controls

Context & threads

GPU acceleration

The API is small. The output is structured.

Bring your own GGUF

Add an offline AI feature to your Flutter app.

Part of the Cognisoc stack

On-device LLMs
for mobile apps.