LLM
This section describes how to use the LlmWrapper class for integrating a native LLM (Large Language Model) library via JNI in your Android project. The latest LlmWrapper provides a coroutine-friendly and builder-pattern-based Kotlin API for interacting with a native LLM backend via JNI.It supports text generation (sync and streaming), embedding, key-value cache, LoRA/model management, sampling configuration, chat templates, and robust error handling with idiomatic Kotlin Result wrappers.
All heavy operations are dispatched via the provided CoroutineDispatcher (defaults to Dispatchers.IO).
Initialize
Use LlmWrapper.builder() to configure model path, tokenizer, device, model config, and dispatcher.Create the wrapper with build() (a suspend function, must be called in a coroutine).
kotlin
Text Generation
Synchronous text generation
kotlin
Streaming generation
kotlin
Embedding
kotlin
KV Cache
kotlin
LoRA Management
kotlin
Sampler Settings
kotlin
Chat Template
kotlin
Profiling Support
Retrieves runtime profiling statistics for the current model/session.This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
Resource Cleanup
Always callwrapper.close()
or use use { ... }
for resource cleanup:
kotlin
VLM
This section describes how to use the VlmWrapper class for integrating a native Vision-Language Model (VLM) via JNI in your Android project. The latest VlmWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for multimodal (vision-language) models.It supports text and multimodal generation (sync and streaming), embedding, tokenization, chat template, and sampler control, with robust error handling via Kotlin Result wrappers.
All operations are dispatched via the specified CoroutineDispatcher (default: Dispatchers.IO).
Initialize
Use VlmWrapper.builder() to configure model path, mmproj path, context length, device, and dispatcher.Create the wrapper with build() (a suspend function).
kotlin
Text Generation
Synchronous text generation
kotlin
Streaming generation
kotlin
Embedding/Tokenization
Embedding
kotlin
Encode/Decode
kotlin
Chat Template
Get or apply chat template:kotlin
Sampler Control
Set or reset sampler:kotlin
Profiling
Retrieves runtime profiling statistics for the current model/session.This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
Resource Cleanup
Always callvlm.close()
or use use { ... }
for resource cleanup:
kotlin
Embedder
This section describes how to use the EmbedderWrapper class for embedding tasks with a native Embedder JNI library in your Android project. EmbedderWrapper is a coroutine-friendly Kotlin wrapper for the JNI-based native Embedder library, used for fast and configurable vector embedding in Android.It features a Builder pattern for easy configuration, robust error handling with Result types, LoRA management, and safe resource cleanup via Closeable.
Initialize
Use EmbedderWrapper.builder() to configure model path, tokenizer, device, and dispatcher.Create the wrapper with build() (a suspend function).
kotlin
Embedding
Synchronous embedding
kotlin
Get embedding dimension
kotlin
LoRA Management
Set, add, remove, or list LoRA modules:kotlin
Profiling Support
Retrieves runtime profiling statistics for the current model/session.This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
Resource Cleanup
Always callembedder.close()
or use use { ... }
for resource cleanup:
kotlin
Reranker
This section describes how to use the RerankerWrapper class for running native reranking models via JNI in your Android project. RerankerWrapper is a coroutine-friendly, builder-pattern-based Kotlin wrapper for JNI-based reranker models.It supports batch reranking of document arrays, robust error handling with Kotlin Result wrappers, and safe resource management via Closeable.
All compute-intensive operations are dispatched via the provided CoroutineDispatcher (default: Dispatchers.IO).
Initialize
Use RerankerWrapper.builder() to configure model path, tokenizer, device, and dispatcher.Build the wrapper with build() (a suspend function).
kotlin
Reranking
Synchronous reranking
kotlin
Profiling Support
Retrieves runtime profiling statistics for the current model/session.This method fills the provided ProfilingData object with metrics such as execution time, memory usage, or other performance statistics (depending on native backend implementation).
kotlin
Resource Cleanup
Always callreranker.close()
or use use { ... }
for resource cleanup:
kotlin
Configuration
ModelConfig Reference
Field Descriptions
nCtx
:Int
Number of context tokens for the model (context length).nThreads
:Int
Number of threads for general model inference.nThreadsBatch
:Int
Number of threads used for batch processing (can be tuned separately).nBatch
:Int
Batch size for prompt tokens.nUBatch
:Int
Micro-batch size for generation. Used for efficiency and memory control.nSeqMax
:Int
Maximum number of concurrent sequences for batch inference.
GenerationConfig Reference
kotlin
Field Descriptions
maxTokens
:Int
The maximum number of tokens to generate.stopWords
:Array<String>? = null
List of stop words; generation stops if any is produced.stopCount
:Int
The number of stop words to trigger stopping (if >0).nPast
:Int
Number of context tokens from past conversation/history.samplerConfig
:SamplerConfig
Configuration for sampling strategies (temperature, top-k, etc).imagePaths
:Array<String>
Paths to image files for multimodal generation (for VLM).imageCount
:Int
Number of images.audioPaths
:Array<String>
Paths to audio files (for multimodal audio input).audioCount
:Int
Number of audio files.
ProfilingData Reference
Field Descriptions
ttftMs
:Double
Time to First Token in milliseconds. Time taken from request start to the first token generated.totalTokens
:Int
Total number of tokens processed in this session (prompt + generated).stopReason
:String?
Reason why generation stopped (e.g., “stopword”, “max_tokens”, etc.), if available.tokensPerSecond
:Double
Token generation speed in tokens per second (throughput).totalTimeMs
:Double
Total time taken for the entire request (milliseconds).promptTimeMs
:Double
Time spent processing the prompt (milliseconds).decodeTimeMs
:Double
Time spent decoding/generated phase (milliseconds).promptTokens
:Int
Number of tokens in the prompt.generatedTokens
:Int
Number of tokens generated by the model.