Documentation Index Fetch the complete documentation index at: https://docs.nexa.ai/llms.txt
Use this file to discover all available pages before exploring further.
Model Name Mapping
For all NPU model, we use an internal namming mapping and please fill in plugin id accordingly.
Model Name Plugin ID Huggingface repository name omni-neural npu NexaAI/OmniNeural-4B-mobile phi3.5 npu NexaAI/phi3.5-mini-npu-mobile phi4 npu NexaAI/phi4-mini-npu-mobile granite4 npu NexaAI/Granite-4-Micro-NPU-mobile embed-gemma npu NexaAI/embeddinggemma-300m-npu-mobile qwen3-4b npu NexaAI/Qwen3-4B-Instruct-2507-npu-mobile llama3-3b npu NexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile liquid-v2 npu NexaAI/LFM2-1.2B-npu-mobile paddleocr npu NexaAI/paddleocr-npu-mobile parakeet npu NexaAI/parakeet-tdt-0.6b-v3-npu-mobile yolo26x npu NexaAI/yolo26x-npu-mobile yolo26l npu NexaAI/yolo26l-npu-mobile yolo26m npu NexaAI/yolo26m-npu-mobile yolo26s npu NexaAI/yolo26s-npu-mobile yolo26n npu NexaAI/yolo26n-npu-mobile depth-anything-v2 npu NexaAI/depth-anything-v2-npu-mobile
Beyond the NEXA-optimized models listed above, any GGUF model from the community can also run on Qualcomm Hexagon NPU. Use the cpu_gpu plugin and set device_id = "dev0", powered by the GGML Hexagon backend.
Two Ways to Run on NPU
You can run models on Qualcomm Hexagon NPU in two different ways :
1) NEXA models via “npu” plugin
Use the npu plugin
Pick a supported NEXA model from the table above and set model_name accordingly
2) GGUF models via GGML Hexagon backend
Load a GGUF model
Use the cpu_gpu plugin
Set device_id to dev0
Set nGpuLayers > 0 in ModelConfig
LLM Usage
Large Language Models for text generation and chat applications.
1) NEXA Models (“npu” plugin)
We support NPU inference for NEXA format models.
LlmWrapper. builder ()
. llmCreateInput (
LlmCreateInput (
model_name = "liquid-v2" ,
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048
),
),
plugin_id = "npu"
)
. build ()
. onSuccess { llmWrapper = it }
val chatList = arrayListOf ( ChatMessage ( "user" , "What is AI?" ))
llmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
llmWrapper. generateStreamFlow (template.formattedText, GenerationConfig ()). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
2) GGUF Models on Hexagon NPU (GGML Hexagon backend)
Run a GGUF model on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
LlmWrapper. builder ()
. llmCreateInput (
LlmCreateInput (
model_name = "" , // GGUF: keep model_name empty
model_path = "<your-gguf-model-path>" , // e.g. /data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
config = ModelConfig (
nCtx = 4096 ,
nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
),
plugin_id = "cpu_gpu" ,
device_id = "dev0" // Use NPU device to run models like GPT-OSS 20B
)
)
. build ()
. onSuccess { llmWrapper = it }
. onFailure { error -> println ( "Error: ${ error.message } " ) }
Multimodal Usage
Vision-Language Models for image understanding and multimodal applications.
1) NEXA Models (“npu” plugin)
We support NPU inference for NEXA format models.
VlmWrapper. builder ()
. vlmCreateInput (
VlmCreateInput (
model_name = "omni-neural" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048 ,
enable_thinking = false
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { vlmWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Use the loaded VLM with image and text
val contents = listOf (
VlmContent ( "image" , <your-image-path>),
VlmContent ( "text" , <your-text>)
)
val chatList = arrayListOf ( VlmChatMessage ( "user" , contents))
vlmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
val config = vlmWrapper. injectMediaPathsToConfig (chatList. toTypedArray (), GenerationConfig ())
vlmWrapper. generateStreamFlow (template.formattedText, config). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
2) GGUF Models on Hexagon NPU (GGML Hexagon backend)
Run a GGUF VLM on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
VlmWrapper. builder ()
. vlmCreateInput (
VlmCreateInput (
model_name = "" , // GGUF: keep model_name empty
model_path = <your-gguf-model-path>,
mmproj_path = <your-mmproj-path>, // vision projection weights
config = ModelConfig (
nCtx = 4096 ,
nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
),
plugin_id = "cpu_gpu" ,
device_id = "dev0"
)
)
. build ()
. onSuccess { vlmWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Use the loaded VLM with image and text
val contents = listOf (
VlmContent ( "image" , <your-image-path>),
VlmContent ( "text" , "<your-text>" )
)
val chatList = arrayListOf ( VlmChatMessage ( "user" , contents))
vlmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
val baseConfig = GenerationConfig (maxTokens = 2048 )
val configWithMedia = vlmWrapper. injectMediaPathsToConfig (
chatList. toTypedArray (),
baseConfig
)
vlmWrapper. generateStreamFlow (template.formattedText, configWithMedia). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
Embeddings Usage
Generate vector embeddings for semantic search and RAG applications.
Basic Usage
// Load embedder for NPU inference
EmbedderWrapper. builder ()
. embedderCreateInput (
EmbedderCreateInput (
model_name = "embed-gemma" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
tokenizer_path = <your-tokenizer-path>, // Optional
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" , // Use NPU backend
device_id = null // Optional device ID
)
)
. build ()
. onSuccess { embedderWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Generate embeddings for multiple texts
val texts = arrayOf (<your-text1>, <your-text2>, ...)
embedderWrapper. embed (texts, EmbeddingConfig ()). onSuccess { embeddings ->
val dimension = embeddings.size / texts.size
println ( "Dimension: $dimension " )
println ( "First 5 values: ${ embeddings. take ( 5 ) } " )
}
ASR Usage
Automatic Speech Recognition for audio transcription.
Basic Usage
// Load ASR model for NPU inference
AsrWrapper. builder ()
. asrCreateInput (
AsrCreateInput (
model_name = "parakeet" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { asrWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Transcribe audio file
asrWrapper. transcribe (
AsrTranscribeInput (
audioPath = <your-audio-path>, // Path to .wav, .mp3, etc.
language = "en" , // Language code: "en", "zh", "es", etc.
timestamps = null // Optional timestamp format
)
). onSuccess { result ->
println ( "Transcription: ${ result.result.transcript } " )
}
Rerank Usage
Improve search relevance by reranking documents based on query relevance.
Basic Usage
// Load reranker model for NPU inference
RerankerWrapper. builder ()
. rerankerCreateInput (
RerankerCreateInput (
model_name = "jina-rerank" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
tokenizer_path = <your-tokenizer-path>, // Optional
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" , // Use NPU backend
device_id = null // Optional device ID
)
)
. build ()
. onSuccess { rerankerWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Rerank documents based on query relevance
val query = "What is machine learning?"
val docs = arrayOf ( "ML is AI subset" , "Weather forecast" , "Deep learning tutorial" )
rerankerWrapper. rerank (query, docs, RerankConfig ()). onSuccess { result ->
result.scores?. withIndex ()?. sortedByDescending { it. value }?. forEach { (idx, score) ->
println ( "Score: ${ "%.4f" . format (score) } - ${ docs[idx] } " )
}
}
CV Usage
Computer Vision models for OCR, object detection, and image classification.
Basic Usage
// Load PaddleOCR model for NPU inference
CvWrapper. builder ()
. createInput (
CVCreateInput (
model_name = "paddleocr" , // Model name
config = CVModelConfig (
capabilities = CVCapability.OCR,
det_model_path = <your-det-model-folder-path>,
rec_model_path = <your-rec-model-path>,
char_dict_path = <your-char-dict-path>,
qnn_model_folder_path = <your-qnn-model-folder-path>, // For NPU
qnn_lib_folder_path = <your-qnn-lib-folder-path> // For NPU
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { cvWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Perform OCR on image
cvWrapper. infer (<your-image-path>). onSuccess { results ->
results. forEach { result ->
println ( "Text: ${ result.text } , Confidence: ${ result.confidence } " )
}
}
Need Help?
Join our community to get support, share your projects, and connect with other developers.
Discord Community Get real-time support and chat with the Nexa AI community
Slack Community Collaborate with developers and access community resources