About

A mobile-first wrapper for llama.cpp.

Name: llamafu
Author: Cognisoc

llamafu is a Flutter FFI plugin that lets mobile app developers run large language models entirely on the device. It is published on pub.dev as llamafu, and under the hood it links to llama.cpp via a thin C++ layer with FFI bindings into Dart.

What it is, concretely

A Flutter plugin (Dart 3.1+, Flutter 3.10+) targeting Android API 21+ and iOS 12.0+.
A native C++ layer (llamafu.cpp) that wraps llama.cpp with RAII and validation.
A Dart API (Llamafu.init, complete, generateJson, etc.) accessed through dart:ffi.
An inference engine that loads models in the GGUF format used by llama.cpp.

What it is not

It is not a hosted inference API. There is no server component.
It is not a model. You bring your own GGUF file (LLaMA 3, Mistral, Phi-3, Qwen2, Gemma, LLaVA, etc.).
It is not a desktop or web SDK. The platform support matrix is Android and iOS.
It is not ONNX, CoreML, or TFLite based. The runtime backend is llama.cpp.

Audience

llamafu is built for mobile app developers who want to add an on-device LLM feature without standing up a cloud inference stack: privacy-first apps, regulated industries, offline-capable products, on-edge tools, and anyone who would rather not pay per token forever. See the use cases for concrete shapes.

Architecture, at a glance

Your Flutter App
   |
Llamafu Dart API   (high-level, typed, async)
   |
FFI bindings        (dart:ffi <-> C bridge)
   |
Native C++ layer    (RAII, memory safety, validation)
   |
llama.cpp engine    (GGUF loading, inference)

For a deeper walk-through of the load → prompt → decode path, see the architecture page.

Capabilities

The plugin exposes the surface area you would expect from a llama.cpp-based runtime: streaming text generation, chat completions with history, embeddings for semantic search, vision/multimodal (LLaVA, Qwen2-VL), tool calling, structured JSON output with schema, GBNF grammar-constrained generation, LoRA adapter loading and hot-swapping, and fine-grained sampling controls (temperature, top-k, top-p, penalties). GPU acceleration is used where available. See the full feature list.

Part of Cognisoc

llamafu is built by Cognisoc, whose focus is LLM inference across runtimes and platforms. It sits alongside sibling projects for .NET, native runtimes, and learning-oriented codebases.

License

MIT-licensed. The source lives at https://github.com/cognisoc/llamafu, and reference documentation is hosted at https://docs.cognisoc.com/llamafu/.