Model Name Mapping
For all NPU model, we use an internal namming mapping and please fill in plugin id accordingly.
Model Name Plugin ID Huggingface repository name omni-neural npu NexaAI/OmniNeural-4B-mobile phi3.5 npu NexaAI/phi3.5-mini-npu-mobile phi4 npu NexaAI/phi4-mini-npu-mobile granite4 npu NexaAI/Granite-4-Micro-NPU-mobile embed-gemma npu NexaAI/embeddinggemma-300m-npu-mobile qwen3-4b npu NexaAI/Qwen3-4B-Instruct-2507-npu-mobile llama3-3b npu NexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile liquid-v2 npu NexaAI/LFM2-1.2B-npu-mobile paddleocr npu NexaAI/paddleocr-npu-mobile parakeet npu NexaAI/parakeet-tdt-0.6b-v3-npu-mobile yolo26x npu NexaAI/yolo26x-npu-mobile yolo26l npu NexaAI/yolo26l-npu-mobile yolo26m npu NexaAI/yolo26m-npu-mobile yolo26s npu NexaAI/yolo26s-npu-mobile yolo26n npu NexaAI/yolo26n-npu-mobile depth-anything-v2 npu NexaAI/depth-anything-v2-npu-mobile
Beyond the NEXA-optimized models listed above, any GGUF model from the community can also run on Qualcomm Hexagon NPU. Use the cpu_gpu plugin and set device_id = "dev0", powered by the GGML Hexagon backend.
Two Ways to Run on NPU
You can run models on Qualcomm Hexagon NPU in two different ways :
1) NEXA models via “npu” plugin
Use the npu plugin
Pick a supported NEXA model from the table above and set model_name accordingly
2) GGUF models via GGML Hexagon backend
Load a GGUF model
Use the cpu_gpu plugin
Set device_id to dev0
Set nGpuLayers > 0 in ModelConfig
LLM Usage
Large Language Models for text generation and chat applications.
1) NEXA Models (“npu” plugin)
We support NPU inference for NEXA format models.
LlmWrapper. builder ()
. llmCreateInput (
LlmCreateInput (
model_name = "liquid-v2" ,
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048
),
),
plugin_id = "npu"
)
. build ()
. onSuccess { llmWrapper = it }
val chatList = arrayListOf ( ChatMessage ( "user" , "What is AI?" ))
llmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
llmWrapper. generateStreamFlow (template.formattedText, GenerationConfig ()). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
2) GGUF Models on Hexagon NPU (GGML Hexagon backend)
Run a GGUF model on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
LlmWrapper. builder ()
. llmCreateInput (
LlmCreateInput (
model_name = "" , // GGUF: keep model_name empty
model_path = "<your-gguf-model-path>" , // e.g. /data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
config = ModelConfig (
nCtx = 4096 ,
nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
),
plugin_id = "cpu_gpu" ,
device_id = "dev0" // Use NPU device to run models like GPT-OSS 20B
)
)
. build ()
. onSuccess { llmWrapper = it }
. onFailure { error -> println ( "Error: ${ error.message } " ) }
Multimodal Usage
Vision-Language Models for image understanding and multimodal applications.
1) NEXA Models (“npu” plugin)
We support NPU inference for NEXA format models.
VlmWrapper. builder ()
. vlmCreateInput (
VlmCreateInput (
model_name = "omni-neural" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048 ,
enable_thinking = false
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { vlmWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Use the loaded VLM with image and text
val contents = listOf (
VlmContent ( "image" , <your-image-path>),
VlmContent ( "text" , <your-text>)
)
val chatList = arrayListOf ( VlmChatMessage ( "user" , contents))
vlmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
val config = vlmWrapper. injectMediaPathsToConfig (chatList. toTypedArray (), GenerationConfig ())
vlmWrapper. generateStreamFlow (template.formattedText, config). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
2) GGUF Models on Hexagon NPU (GGML Hexagon backend)
Run a GGUF VLM on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
VlmWrapper. builder ()
. vlmCreateInput (
VlmCreateInput (
model_name = "" , // GGUF: keep model_name empty
model_path = <your-gguf-model-path>,
mmproj_path = <your-mmproj-path>, // vision projection weights
config = ModelConfig (
nCtx = 4096 ,
nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
),
plugin_id = "cpu_gpu" ,
device_id = "dev0"
)
)
. build ()
. onSuccess { vlmWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Use the loaded VLM with image and text
val contents = listOf (
VlmContent ( "image" , <your-image-path>),
VlmContent ( "text" , "<your-text>" )
)
val chatList = arrayListOf ( VlmChatMessage ( "user" , contents))
vlmWrapper. applyChatTemplate (chatList. toTypedArray (), null , false ). onSuccess { template ->
val baseConfig = GenerationConfig (maxTokens = 2048 )
val configWithMedia = vlmWrapper. injectMediaPathsToConfig (
chatList. toTypedArray (),
baseConfig
)
vlmWrapper. generateStreamFlow (template.formattedText, configWithMedia). collect { result ->
when (result) {
is LlmStreamResult.Token -> println (result.text)
is LlmStreamResult.Completed -> println ( "Done!" )
is LlmStreamResult.Error -> println ( "Error: ${ result.throwable } " )
}
}
}
Embeddings Usage
Generate vector embeddings for semantic search and RAG applications.
Basic Usage
// Load embedder for NPU inference
EmbedderWrapper. builder ()
. embedderCreateInput (
EmbedderCreateInput (
model_name = "embed-gemma" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
tokenizer_path = <your-tokenizer-path>, // Optional
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" , // Use NPU backend
device_id = null // Optional device ID
)
)
. build ()
. onSuccess { embedderWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Generate embeddings for multiple texts
val texts = arrayOf (<your-text1>, <your-text2>, ...)
embedderWrapper. embed (texts, EmbeddingConfig ()). onSuccess { embeddings ->
val dimension = embeddings.size / texts.size
println ( "Dimension: $dimension " )
println ( "First 5 values: ${ embeddings. take ( 5 ) } " )
}
ASR Usage
Automatic Speech Recognition for audio transcription.
Basic Usage
// Load ASR model for NPU inference
AsrWrapper. builder ()
. asrCreateInput (
AsrCreateInput (
model_name = "parakeet" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { asrWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Transcribe audio file
asrWrapper. transcribe (
AsrTranscribeInput (
audioPath = <your-audio-path>, // Path to .wav, .mp3, etc.
language = "en" , // Language code: "en", "zh", "es", etc.
timestamps = null // Optional timestamp format
)
). onSuccess { result ->
println ( "Transcription: ${ result.result.transcript } " )
}
Rerank Usage
Improve search relevance by reranking documents based on query relevance.
Basic Usage
// Load reranker model for NPU inference
RerankerWrapper. builder ()
. rerankerCreateInput (
RerankerCreateInput (
model_name = "jina-rerank" , // Model name for NPU plugin
model_path = <your-model-folder-path>,
tokenizer_path = <your-tokenizer-path>, // Optional
config = ModelConfig (
max_tokens = 2048
),
plugin_id = "npu" , // Use NPU backend
device_id = null // Optional device ID
)
)
. build ()
. onSuccess { rerankerWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Rerank documents based on query relevance
val query = "What is machine learning?"
val docs = arrayOf ( "ML is AI subset" , "Weather forecast" , "Deep learning tutorial" )
rerankerWrapper. rerank (query, docs, RerankConfig ()). onSuccess { result ->
result.scores?. withIndex ()?. sortedByDescending { it. value }?. forEach { (idx, score) ->
println ( "Score: ${ "%.4f" . format (score) } - ${ docs[idx] } " )
}
}
CV Usage
Computer Vision models for OCR, object detection, and image classification.
Basic Usage
// Load PaddleOCR model for NPU inference
CvWrapper. builder ()
. createInput (
CVCreateInput (
model_name = "paddleocr" , // Model name
config = CVModelConfig (
capabilities = CVCapability.OCR,
det_model_path = <your-det-model-folder-path>,
rec_model_path = <your-rec-model-path>,
char_dict_path = <your-char-dict-path>,
qnn_model_folder_path = <your-qnn-model-folder-path>, // For NPU
qnn_lib_folder_path = <your-qnn-lib-folder-path> // For NPU
),
plugin_id = "npu" // Use NPU backend
)
)
. build ()
. onSuccess { cvWrapper = it }
. onFailure { error ->
println ( "Error: ${ error.message } " )
}
// Perform OCR on image
cvWrapper. infer (<your-image-path>). onSuccess { results ->
results. forEach { result ->
println ( "Text: ${ result.text } , Confidence: ${ result.confidence } " )
}
}
Need Help?
Join our community to get support, share your projects, and connect with other developers.
Discord Community Get real-time support and chat with the Nexa AI community
Slack Community Collaborate with developers and access community resources