LLM

This section describes how to use the LlmWrapper class for integrating a native LLM (Large Language Model) library via JNI in your Android project. The latest LlmWrapper provides a coroutine-friendly and builder-pattern-based Kotlin API for interacting with a native LLM backend via JNI.
It supports text generation (sync and streaming), embedding, key-value cache, LoRA/model management, sampling configuration, chat templates, and robust error handling with idiomatic Kotlin Result wrappers.
All heavy operations are dispatched via the provided CoroutineDispatcher (defaults to Dispatchers.IO).

Initialize

Use LlmWrapper.builder() to configure model path, tokenizer, device, model config, and dispatcher.
Create the wrapper with build() (a suspend function, must be called in a coroutine).
kotlin
val wrapperResult = LlmWrapper.builder()
  .modelPath("/path/to/model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .modelConfig(ModelConfig(/* custom config values */))
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, must be called in a coroutine

val wrapper = wrapperResult.getOrNull() ?: return

Text Generation

Synchronous text generation

kotlin
val result = wrapper.generate("Hello, Android!", GenerationConfig())
result.onSuccess { println(it) }
result.onFailure { e -> println("Error: $e") }

Streaming generation

kotlin
wrapper.generateStreamFlow("Stream prompt...", GenerationConfig())
  .collect { result ->
    result.onSuccess { token -> print(token) }
    result.onFailure { e -> println("Stream error: $e") }
  }

Embedding

kotlin
val embeddingResult = wrapper.embed(arrayOf("text for embedding"))

KV Cache

kotlin
wrapper.saveKvCache("/tmp/cache.bin")
wrapper.loadKvCache("/tmp/cache.bin")

LoRA Management

kotlin
wrapper.addLora("/path/to/lora.bin")
wrapper.setLora(1)
wrapper.removeLora(1)

Sampler Settings

kotlin
wrapper.setSampler(SamplerConfig(...))
wrapper.resetSampler()

Chat Template

kotlin
val result = wrapper.applyChatTemplate(arrayOf(ChatMessage(...)))
result.onSuccess { prompt ->
  println("Chat prompt: $prompt")
}
result.onFailure { error ->
  println("Failed to apply template: $error")
}

Profiling Support

Retrieves runtime profiling statistics for the current model/session.
This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call wrapper.close() or use use { ... } for resource cleanup:
kotlin
wrapper.close()
// or
wrapper.use { ... }

VLM

This section describes how to use the VlmWrapper class for integrating a native Vision-Language Model (VLM) via JNI in your Android project. The latest VlmWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for multimodal (vision-language) models.
It supports text and multimodal generation (sync and streaming), embedding, tokenization, chat template, and sampler control, with robust error handling via Kotlin Result wrappers.
All operations are dispatched via the specified CoroutineDispatcher (default: Dispatchers.IO).

Initialize

Use VlmWrapper.builder() to configure model path, mmproj path, context length, device, and dispatcher.
Create the wrapper with build() (a suspend function).
kotlin
val vlmResult = VlmWrapper.builder()
  .modelPath("/path/to/vlm-model.gguf")
  .mmprojPath("/path/to/mmproj.bin")
  .ctxLen(4096)
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend, call in coroutine

val vlm = vlmResult.getOrNull() ?: return

Text Generation

Synchronous text generation

kotlin
val result = vlm.generate("Describe this image:", GenerationConfig())
result.onSuccess { println(it) }
result.onFailure { e -> println("Error: $e") }

Streaming generation

kotlin
vlm.generateStreamFlow("Stream prompt...", GenerationConfig())
  .collect { result ->
    result.onSuccess { token -> print(token) }
    result.onFailure { e -> println("Stream error: $e") }
  }

Embedding/Tokenization

Embedding

kotlin
val embeddingResult = vlm.embed(arrayOf("text for embedding"))

Encode/Decode

kotlin
val encodeResult = vlm.encode("your text", IntArray(512))
val decodeResult = vlm.decode(intArrayOf(1,2,3), 3)

Chat Template

Get or apply chat template:
kotlin
val templateResult = vlm.getChatTemplate("your_template_name")
val applyResult = vlm.applyChatTemplate(arrayOf(ChatMessage(...)))

Sampler Control

Set or reset sampler:
kotlin
vlm.setSampler(SamplerConfig(...))
vlm.resetSampler()

Profiling

Retrieves runtime profiling statistics for the current model/session.
This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call vlm.close() or use use { ... } for resource cleanup:
kotlin
vlm.reset()
vlm.close()
// or
vlm.use { /* ... */ }

Embedder

This section describes how to use the EmbedderWrapper class for embedding tasks with a native Embedder JNI library in your Android project. EmbedderWrapper is a coroutine-friendly Kotlin wrapper for the JNI-based native Embedder library, used for fast and configurable vector embedding in Android.
It features a Builder pattern for easy configuration, robust error handling with Result types, LoRA management, and safe resource cleanup via Closeable.

Initialize

Use EmbedderWrapper.builder() to configure model path, tokenizer, device, and dispatcher.
Create the wrapper with build() (a suspend function).
kotlin
val embedderResult = EmbedderWrapper.builder()
  .modelPath("/path/to/embedding-model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, call in coroutine

val embedder = embedderResult.getOrNull() ?: return

Embedding

Synchronous embedding

kotlin
val result = embedder.embed(arrayOf("your text here"), EmbeddingConfig())
result.onSuccess { embedding ->
  // use embedding vector
}
result.onFailure { e ->
  println("Embedding failed: $e")
}

Get embedding dimension

kotlin
val dimResult = embedder.embeddingDim()
dimResult.onSuccess { dim -> println("Embedding dim: $dim") }

LoRA Management

Set, add, remove, or list LoRA modules:
kotlin
embedder.setLora(loraId)
embedder.addLora("/path/to/lora.bin")
embedder.removeLora(loraId)
embedder.listLoras().onSuccess { ids -> println("All LoRAs: ${ids.joinToString()}") }

Profiling Support

Retrieves runtime profiling statistics for the current model/session.
This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call embedder.close() or use use { ... } for resource cleanup:
kotlin
embedder.close()
// or
embedder.use { /* ... */ }

Reranker

This section describes how to use the RerankerWrapper class for running native reranking models via JNI in your Android project. RerankerWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for JNI-based reranker models.
It supports batch reranking of document arrays, robust error handling with Kotlin Result wrappers, and safe resource management via Closeable.
All compute-intensive operations are dispatched via the provided CoroutineDispatcher (default: Dispatchers.IO).

Initialize

Use RerankerWrapper.builder() to configure model path, tokenizer, device, and dispatcher.
Build the wrapper with build() (a suspend function).
kotlin
val rerankerResult = RerankerWrapper.builder()
  .modelPath("/path/to/rerank-model.gguf")
  .tokenizerPath("/path/to/tokenizer.model")
  .device("cpu")
  .dispatcher(Dispatchers.IO)
  .build()  // suspend function, call in coroutine

val reranker = rerankerResult.getOrNull() ?: return

Reranking

Synchronous reranking

kotlin
val rerankResult = reranker.rerank(
  query = "Which documents are most relevant?",
  documents = arrayOf("doc1", "doc2", "doc3"),
  config = RerankConfig()
)
rerankResult.onSuccess { scores ->
  // scores[i] is the score for documents[i]
  println("Rerank scores: ${scores.joinToString()}")
}
rerankResult.onFailure { e ->
  println("Rerank failed: $e")
}

Profiling Support

Retrieves runtime profiling statistics for the current model/session.
This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
val profilingData = ProfilingData()
val status = llm.getProfilingData(handle, profilingData)
if (status == 0) {
  println("Profiling info: $profilingData")
} else {
  println("Failed to get profiling data")
}

Resource Cleanup

Always call reranker.close() or use use { ... } for resource cleanup:
kotlin
reranker.close()
// or
reranker.use { /* ... */ }

Configuration

ModelConfig Reference

Field Descriptions

  • nCtx: Int
    Number of context tokens for the model (context length).
  • nThreads: Int
    Number of threads for general model inference.
  • nThreadsBatch: Int
    Number of threads used for batch processing (can be tuned separately).
  • nBatch: Int
    Batch size for prompt tokens.
  • nUBatch: Int
    Micro-batch size for generation. Used for efficiency and memory control.
  • nSeqMax: Int
    Maximum number of concurrent sequences for batch inference.

GenerationConfig Reference

kotlin
data class GenerationConfig(
  var maxTokens: Int = 32,
  var stopWords: Array<String>? = null,
  var stopCount: Int = 0,
  var nPast: Int = 0,
  var samplerConfig: SamplerConfig = SamplerConfig(),
  var imagePaths: Array<String>? = null,
  var imageCount: Int = 0,
  var audioPaths: Array<String>? = null,
  var audioCount: Int = 0
)

Field Descriptions

  • maxTokens: Int
    The maximum number of tokens to generate.
  • stopWords: Array<String>? = null
    List of stop words; generation stops if any is produced.
  • stopCount: Int
    The number of stop words to trigger stopping (if >0).
  • nPast: Int
    Number of context tokens from past conversation/history.
  • samplerConfig: SamplerConfig
    Configuration for sampling strategies (temperature, top-k, etc).
  • imagePaths: Array<String>
    Paths to image files for multimodal generation (for VLM).
  • imageCount: Int
    Number of images.
  • audioPaths: Array<String>
    Paths to audio files (for multimodal audio input).
  • audioCount: Int
    Number of audio files.

ProfilingData Reference

Field Descriptions

  • ttftMs: Double
    Time to First Token in milliseconds. Time taken from request start to the first token generated.
  • totalTokens: Int
    Total number of tokens processed in this session (prompt + generated).
  • stopReason: String?
    Reason why generation stopped (e.g., “stopword”, “max_tokens”, etc.), if available.
  • tokensPerSecond: Double
    Token generation speed in tokens per second (throughput).
  • totalTimeMs: Double
    Total time taken for the entire request (milliseconds).
  • promptTimeMs: Double
    Time spent processing the prompt (milliseconds).
  • decodeTimeMs: Double
    Time spent decoding/generated phase (milliseconds).
  • promptTokens: Int
    Number of tokens in the prompt.
  • generatedTokens: Int
    Number of tokens generated by the model.