Nous.Providers.LlamaCpp (nous v0.17.0)

LlamaCpp NIF-based provider for local LLM inference.

Runs GGUF models directly in-process via llama_cpp_ex NIF bindings. No HTTP server needed.

Requires optional dep: {:llama_cpp_ex, "~> 0.6.5"}

Usage

# Load model once at app start
:ok = LlamaCppEx.init()
{:ok, llm} = LlamaCppEx.load_model("model.gguf", n_gpu_layers: -1)

# Use with Nous
agent = Nous.new("llamacpp:local",
  llamacpp_model: llm,
  instructions: "You are helpful."
)

{:ok, result} = Nous.run(agent, "What is Elixir?")

Configuration

The llamacpp_model (the loaded model reference) must be passed via options when creating the model or agent. It is stored in default_settings.

No API key or base URL is needed since inference runs locally via NIFs.

Settings Mapping

Nous settings are mapped to LlamaCppEx options:

Nous Setting	LlamaCppEx Option	Description
`:temperature`	`:temp`	Sampling temperature
`:max_tokens`	`:max_tokens`	Maximum tokens to generate
`:top_p`	`:top_p`	Nucleus sampling
`:json_schema`	`:json_schema`	Constrained JSON output
`:enable_thinking`	`:enable_thinking`	Enable/disable thinking tokens

Thinking Models

Models like Qwen3 emit <think>...</think> tags by default. To disable:

agent = Nous.new("llamacpp:local",
  llamacpp_model: llm,
  model_settings: %{enable_thinking: false}
)

Or via generate_text:

{:ok, text} = Nous.generate_text("llamacpp:local", "Hello",
  llamacpp_model: llm,
  enable_thinking: false
)

Summary

Functions

api_key(opts \\ [])

Get the API key from options, environment, or application config.

base_url(opts \\ [])

Get the base URL from options, application config, or default.

count_tokens(messages)

Count tokens in messages (rough estimate).

request(model, messages, settings)

High-level request with message conversion, telemetry, and error wrapping.

request_stream(model, messages, settings)

High-level streaming request with message conversion and telemetry.