llamafu / cognisoc

structured-output · tool-calling

Structured output on-device with grammars and schemas

There is a particular failure mode that anyone who has tried to use a local LLM for a real product knows by heart: you ask the model for JSON, the model produces something that is almost JSON, and your parser falls over on a trailing comma or a smart quote or a missing brace. You try a better prompt. You ask the model “please respond with only valid JSON, do not include any markdown.” You get back markdown. You try few-shot examples. Sometimes it works. Sometimes the model hallucinates a sixth field that does not exist. Sometimes the smart quotes come back.

Cloud models with strict “JSON mode” features can solve this for you. But those features are server-side: they wrap the sampler with a grammar constraint, and the grammar refuses to emit tokens that would lead to invalid JSON. There’s no reason that has to be a server feature. The grammar constraint runs at the level of the sampler, and llamafu exposes the same machinery on-device.

This post is about how to use it.

Two layers: grammars and JSON schemas

llamafu exposes two related APIs for constrained generation, and it’s worth understanding which one you want.

The lower-level API is GBNF grammar-constrained generation. You give the sampler a context-free grammar in GBNF (a llama.cpp-flavored variant of BNF). At every sampling step, the grammar tells the sampler which tokens are still legal. Tokens that would violate the grammar get their probability driven to zero. The result is guaranteed to match the grammar.

The higher-level API is structured JSON output. You give llamafu a JSON schema. Internally, llamafu converts the schema to a grammar and runs the same sampler-level constraint. The result is guaranteed to be valid JSON that matches your schema. You don’t have to think about BNF.

For most product code, you want the schema API. The grammar API is the escape hatch for everything else: custom DSLs, specific code shapes, constrained natural language (“answer with only one of A, B, or C”).

The schema API, in practice

Here’s the example from the README, lightly annotated:

final result = await llamafu.generateJson(
  prompt: 'Extract: John is 25 years old',
  schema: {
    'type': 'object',
    'properties': {
      'name': {'type': 'string'},
      'age': {'type': 'integer'},
    },
    'required': ['name', 'age'],
  },
);

print(result);  // {"name": "John", "age": 25}

A few things are happening here that are worth naming. The model is being asked to extract structured data from a natural language sentence — a classic LLM task. The schema tells llamafu what shape the answer must take: an object with two specific fields, both required, one a string and one an integer. The sampler is constrained to only produce tokens that keep the output a valid instance of that schema. The result is, by construction, parseable JSON that matches.

This works on a quantized 7B model. It works on a quantized 3B model. It works on the kinds of small models you can comfortably ship in a mobile app, because the grammar constraint is doing the work that the cloud model would otherwise be doing through brute statistical force.

Why this is bigger than it looks

Once you can rely on structured output, the design space for on-device LLM features changes substantially.

You can use the model as an extraction layer: feed in a user’s free-form text, get back typed data your app can act on. That’s calendar event extraction, address parsing, intent classification, form auto-fill, and about a dozen other features that previously required either a separate NLU pipeline or a cloud API.

You can use it as a router: ask the model which of several actions a user’s request maps to, constrain the output to one of those literal strings, and skip the prompt-fragility of “did the model say ‘send_email’ or ‘Send Email’ this time?”

You can use it as a tool-call planner. That’s what the tool-calling API does — and it’s the next layer up.

Tool calling builds on the same foundation

llamafu’s generateToolCall() is structured output with the schema prefilled to “this is a function call.” You give it a list of tools, each with a name, description, and JSON-schema parameters:

final weatherTool = Tool(
  name: 'get_weather',
  description: 'Get weather for a location',
  parameters: {
    'type': 'object',
    'properties': {
      'location': {'type': 'string'},
    },
    'required': ['location'],
  },
);

final toolCall = await llamafu.generateToolCall(
  prompt: "What's the weather in Paris?",
  tools: [weatherTool],
);

print(toolCall.name);       // "get_weather"
print(toolCall.arguments);  // {"location": "Paris"}

The sampler is constrained to produce a JSON object that conforms to one of the tool schemas. The result is a typed object you can dispatch on. This is the same pattern that powers modern AI agents in the cloud — and it runs locally, on the same models, with the same reliability guarantees.

When to drop down to a raw grammar

A few cases where the GBNF API is worth the trouble:

  • You’re generating a small DSL — a SQL fragment, a regex, a config snippet — and JSON schema isn’t expressive enough.
  • You’re generating constrained natural language (“a single sentence ending in a question mark”) and want to encode that directly.
  • You want to enforce ordering constraints that are awkward in JSON schema (“first the reasoning, then the answer”).

The shape of that code looks like the README example:

const jsonGrammar = '''
root ::= object
object ::= "{" ws string ":" ws value "}" ws
string ::= "\\"" [a-zA-Z]+ "\\""
value ::= string | number
number ::= [0-9]+
ws ::= [ ]*
''';

final result = await llamafu.completeWithGrammar(
  prompt: 'Generate user data:',
  grammarStr: jsonGrammar,
  grammarRoot: 'root',
  maxTokens: 100,
);

You hand-write (or generate) a grammar, you point llamafu at the root production, and the sampler does the rest.

A note on quality

Grammar constraints guarantee syntactic correctness. They do not guarantee semantic correctness. A model can still produce {"name": "John", "age": -7} — valid JSON, valid schema, nonsense answer. You still need to validate the values your app cares about, just like you would with input from any other source. What you no longer have to worry about is whether the model is going to wrap its response in a markdown fence today.

That alone is enough to make grammar-constrained generation one of the most underrated features in the llamafu surface area. If you’re shipping an on-device LLM feature and your output schema matters, start there.

The docs have the full grammar guide and a list of supported JSON-schema features.


← Back to all posts