API Reference - Documentations

LLM

This section describes how to use the LlmWrapper class for integrating a native LLM (Large Language Model) library via JNI in your Android project. The latest LlmWrapper provides a coroutine-friendly and builder-pattern-based Kotlin API for interacting with a native LLM backend via JNI.
It supports text generation (sync and streaming), embedding, key-value cache, LoRA/model management, sampling configuration, chat templates, and robust error handling with idiomatic Kotlin Result wrappers.
All heavy operations are dispatched via the provided CoroutineDispatcher (defaults to Dispatchers.IO).

Initialize

Use LlmWrapper.builder() to configure model path, tokenizer, device, model config, and dispatcher.
Create the wrapper with build() (a suspend function, must be called in a coroutine).

kotlin

val wrapperResult = LlmWrapper.builder()
  .modelPath("/path/to/model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .modelConfig(ModelConfig(/* custom config values */))
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, must be called in a coroutine

val wrapper = wrapperResult.getOrNull() ?: return

Text Generation

Synchronous text generation

kotlin

val result = wrapper.generate("Hello, Android!", GenerationConfig())
result.onSuccess { println(it) }
result.onFailure { e -> println("Error: $e") }

Streaming generation

kotlin

wrapper.generateStreamFlow("Stream prompt...", GenerationConfig())
  .collect { result ->
    result.onSuccess { token -> print(token) }
    result.onFailure { e -> println("Stream error: $e") }
  }

Embedding

kotlin

val embeddingResult = wrapper.embed(arrayOf("text for embedding"))

KV Cache

kotlin

wrapper.saveKvCache("/tmp/cache.bin")
wrapper.loadKvCache("/tmp/cache.bin")

LoRA Management

kotlin

wrapper.addLora("/path/to/lora.bin")
wrapper.setLora(1)
wrapper.removeLora(1)

Sampler Settings

kotlin

wrapper.setSampler(SamplerConfig(...))
wrapper.resetSampler()

Chat Template

kotlin

val result = wrapper.applyChatTemplate(arrayOf(ChatMessage(...)))
result.onSuccess { prompt ->
  println("Chat prompt: $prompt")
}
result.onFailure { error ->
  println("Failed to apply template: $error")
}

Profiling Support

Retrieves runtime profiling statistics for the current model/session.
This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).

kotlin

val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call wrapper.close() or use use { ... } for resource cleanup:

kotlin

wrapper.close()
// or
wrapper.use { ... }

VLM

This section describes how to use the VlmWrapper class for integrating a native Vision-Language Model (VLM) via JNI in your Android project. The latest VlmWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for multimodal (vision-language) models.
It supports text and multimodal generation (sync and streaming), embedding, tokenization, chat template, and sampler control, with robust error handling via Kotlin Result wrappers.
All operations are dispatched via the specified CoroutineDispatcher (default: Dispatchers.IO).

Initialize

Use VlmWrapper.builder() to configure model path, mmproj path, context length, device, and dispatcher.
Create the wrapper with build() (a suspend function).

kotlin

val vlmResult = VlmWrapper.builder()
  .modelPath("/path/to/vlm-model.gguf")
  .mmprojPath("/path/to/mmproj.bin")
  .ctxLen(4096)
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend, call in coroutine

val vlm = vlmResult.getOrNull() ?: return

Text Generation

Synchronous text generation

kotlin

val result = vlm.generate("Describe this image:", GenerationConfig())
result.onSuccess { println(it) }
result.onFailure { e -> println("Error: $e") }

Streaming generation

kotlin

vlm.generateStreamFlow("Stream prompt...", GenerationConfig())
  .collect { result ->
    result.onSuccess { token -> print(token) }
    result.onFailure { e -> println("Stream error: $e") }
  }

Embedding/Tokenization

Embedding

kotlin

val embeddingResult = vlm.embed(arrayOf("text for embedding"))

Encode/Decode

kotlin

val encodeResult = vlm.encode("your text", IntArray(512))
val decodeResult = vlm.decode(intArrayOf(1,2,3), 3)

Chat Template

Get or apply chat template:

kotlin

val templateResult = vlm.getChatTemplate("your_template_name")
val applyResult = vlm.applyChatTemplate(arrayOf(ChatMessage(...)))

Sampler Control

Set or reset sampler:

kotlin

vlm.setSampler(SamplerConfig(...))
vlm.resetSampler()

Profiling

kotlin

val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call vlm.close() or use use { ... } for resource cleanup:

kotlin

vlm.reset()
vlm.close()
// or
vlm.use { /* ... */ }

Embedder

This section describes how to use the EmbedderWrapper class for embedding tasks with a native Embedder JNI library in your Android project. EmbedderWrapper is a coroutine-friendly Kotlin wrapper for the JNI-based native Embedder library, used for fast and configurable vector embedding in Android.
It features a Builder pattern for easy configuration, robust error handling with Result types, LoRA management, and safe resource cleanup via Closeable.

Initialize

Use EmbedderWrapper.builder() to configure model path, tokenizer, device, and dispatcher.
Create the wrapper with build() (a suspend function).

kotlin

val embedderResult = EmbedderWrapper.builder()
  .modelPath("/path/to/embedding-model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, call in coroutine

val embedder = embedderResult.getOrNull() ?: return

Embedding

Synchronous embedding

kotlin

val result = embedder.embed(arrayOf("your text here"), EmbeddingConfig())
result.onSuccess { embedding ->
  // use embedding vector
}
result.onFailure { e ->
  println("Embedding failed: $e")
}

Get embedding dimension

kotlin

val dimResult = embedder.embeddingDim()
dimResult.onSuccess { dim -> println("Embedding dim: $dim") }

LoRA Management

Set, add, remove, or list LoRA modules:

kotlin

embedder.setLora(loraId)
embedder.addLora("/path/to/lora.bin")
embedder.removeLora(loraId)
embedder.listLoras().onSuccess { ids -> println("All LoRAs: ${ids.joinToString()}") }

Profiling Support

kotlin

val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call embedder.close() or use use { ... } for resource cleanup:

kotlin

embedder.close()
// or
embedder.use { /* ... */ }

Reranker

This section describes how to use the RerankerWrapper class for running native reranking models via JNI in your Android project. RerankerWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for JNI-based reranker models.
It supports batch reranking of document arrays, robust error handling with Kotlin Result wrappers, and safe resource management via Closeable.
All compute-intensive operations are dispatched via the provided CoroutineDispatcher (default: Dispatchers.IO).

Initialize

Use RerankerWrapper.builder() to configure model path, tokenizer, device, and dispatcher.
Build the wrapper with build() (a suspend function).

kotlin

val rerankerResult = RerankerWrapper.builder()
  .modelPath("/path/to/rerank-model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, call in coroutine

val reranker = rerankerResult.getOrNull() ?: return

Reranking

Synchronous reranking

kotlin

val rerankResult = reranker.rerank(
  query = "Which documents are most relevant?",
  documents = arrayOf("doc1", "doc2", "doc3"),
  config = RerankConfig()
)
rerankResult.onSuccess { scores ->
  // scores[i] is the score for documents[i]
  println("Rerank scores: ${scores.joinToString()}")
}
rerankResult.onFailure { e ->
  println("Rerank failed: $e")
}

Profiling Support

kotlin

val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call reranker.close() or use use { ... } for resource cleanup:

kotlin

reranker.close()
// or
reranker.use { /* ... */ }

Configuration

ModelConfig Reference

Field Descriptions

nCtx: Int
Number of context tokens for the model (context length).
nThreads: Int
Number of threads for general model inference.
nThreadsBatch: Int
Number of threads used for batch processing (can be tuned separately).
nBatch: Int
Batch size for prompt tokens.
nUBatch: Int
Micro-batch size for generation. Used for efficiency and memory control.
nSeqMax: Int
Maximum number of concurrent sequences for batch inference.

GenerationConfig Reference

kotlin

data class GenerationConfig(
  var maxTokens: Int = 32,
  var stopWords: Array<String>? = null,
  var stopCount: Int = 0,
  var nPast: Int = 0,
  var samplerConfig: SamplerConfig = SamplerConfig(),
  var imagePaths: Array<String>? = null,
  var imageCount: Int = 0,
  var audioPaths: Array<String>? = null,
  var audioCount: Int = 0
)

Field Descriptions

maxTokens: Int
The maximum number of tokens to generate.
stopWords: Array<String>? = null
List of stop words; generation stops if any is produced.
stopCount: Int
The number of stop words to trigger stopping (if >0).
nPast: Int
Number of context tokens from past conversation/history.
samplerConfig: SamplerConfig
Configuration for sampling strategies (temperature, top-k, etc).
imagePaths: Array<String>
Paths to image files for multimodal generation (for VLM).
imageCount: Int
Number of images.
audioPaths: Array<String>
Paths to audio files (for multimodal audio input).
audioCount: Int
Number of audio files.

ProfilingData Reference

Field Descriptions

ttftMs: Double
Time to First Token in milliseconds. Time taken from request start to the first token generated.
totalTokens: Int
Total number of tokens processed in this session (prompt + generated).
stopReason: String?
Reason why generation stopped (e.g., “stopword”, “max_tokens”, etc.), if available.
tokensPerSecond: Double
Token generation speed in tokens per second (throughput).
totalTimeMs: Double
Total time taken for the entire request (milliseconds).
promptTimeMs: Double
Time spent processing the prompt (milliseconds).
decodeTimeMs: Double
Time spent decoding/generated phase (milliseconds).
promptTokens: Int
Number of tokens in the prompt.
generatedTokens: Int
Number of tokens generated by the model.

Get Started

Usage

Python SDK

Mobile

​LLM

​Initialize

​Text Generation

​Synchronous text generation

​Streaming generation

​Embedding

​KV Cache

​LoRA Management

​Sampler Settings

​Chat Template

​Profiling Support

​Resource Cleanup

​VLM

​Initialize

​Text Generation

​Synchronous text generation

​Streaming generation

​Embedding/Tokenization

​Embedding

​Encode/Decode

​Chat Template

​Sampler Control

​Profiling

​Resource Cleanup

​Embedder

​Initialize

​Embedding

​Synchronous embedding

​Get embedding dimension

​LoRA Management

​Profiling Support

​Resource Cleanup

​Reranker

​Initialize

​Reranking

​Synchronous reranking

​Profiling Support

​Resource Cleanup

​Configuration

​ModelConfig Reference

​Field Descriptions

​GenerationConfig Reference

​Field Descriptions

​ProfilingData Reference

​Field Descriptions

LLM

Initialize

Text Generation

Synchronous text generation

Streaming generation

Embedding

KV Cache

LoRA Management

Sampler Settings

Chat Template

Profiling Support

Resource Cleanup

VLM

Initialize

Text Generation

Synchronous text generation

Streaming generation

Embedding/Tokenization

Embedding

Encode/Decode

Chat Template

Sampler Control

Profiling

Resource Cleanup

Embedder

Initialize

Embedding

Synchronous embedding

Get embedding dimension

LoRA Management

Profiling Support

Resource Cleanup

Reranker

Initialize

Reranking

Synchronous reranking

Profiling Support

Resource Cleanup

Configuration

ModelConfig Reference

Field Descriptions

GenerationConfig Reference

Field Descriptions

ProfilingData Reference

Field Descriptions