llamafu / cognisoc

Flutter FFI plugin · Android & iOS

On-device LLMs
for mobile apps.

Run AI models directly on mobile devices. No cloud. No latency. Complete privacy. llamafu is a Flutter plugin built on llama.cpp for running GGUF models locally on Android (API 21+) and iOS (12.0+).

$ flutter pub add llamafu
main.dart
final llamafu = await Llamafu.init(
  modelPath: '/path/to/model.gguf',
  threads: 4,
  contextSize: 2048,
);

final result = await llamafu.complete(
  prompt: 'Explain quantum computing:',
  maxTokens: 256,
);

llamafu.close();

Why on-device?

Four reasons mobile teams ship llamafu instead of calling a cloud API.

100% On-Device

No API keys, no network calls, works offline.

Privacy First

User data never leaves the device.

Low Latency

No round-trip to a cloud server.

Cost Effective

No per-token charges as you scale.

Core

  • Text generation with streaming
  • Chat completions with history
  • Embeddings for semantic search
  • Tokenize / detokenize APIs

Advanced

  • Vision / multimodal (LLaVA, Qwen2-VL)
  • Tool calling & function calls
  • Structured JSON with schema
  • GBNF grammar-constrained output

Customization

  • LoRA adapter load & hot-swap
  • Temperature, top-k, top-p, penalties
  • Configurable context size + threads
  • GPU acceleration where available

Bring your own GGUF

llamafu loads any model in the GGUF format used by llama.cpp. Pick a quantization that fits your device class.

GENERAL

LLaMA 3, Mistral, Phi-3, Qwen2, Gemma 2

CODE

Code LLaMA, DeepSeek Coder, StarCoder2

VISION

LLaVA, Qwen2-VL, Moondream

SMALL / FAST

Phi-3 Mini, TinyLlama, Gemma 2B

Recommended quantizations for mobile: Q4_K_M for balanced quality, Q4_0 for speed, Q8_0 when you have headroom.

Ship an offline AI feature this sprint.

Add llamafu to your Flutter app, load a GGUF model, call complete(). No keys, no servers, no per-token bill.