NPU

Model Name Mapping

For all NPU model, we use an internal namming mapping and please fill in plugin id accordingly.

Model Name	Plugin ID	Huggingface repository name
omni-neural	npu	NexaAI/OmniNeural-4B-mobile
phi3.5	npu	NexaAI/phi3.5-mini-npu-mobile
phi4	npu	NexaAI/phi4-mini-npu-mobile
granite4	npu	NexaAI/Granite-4-Micro-NPU-mobile
embed-gemma	npu	NexaAI/embeddinggemma-300m-npu-mobile
qwen3-4b	npu	NexaAI/Qwen3-4B-Instruct-2507-npu-mobile
llama3-3b	npu	NexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile
liquid-v2	npu	NexaAI/LFM2-1.2B-npu-mobile
paddleocr	npu	NexaAI/paddleocr-npu-mobile
parakeet	npu	NexaAI/parakeet-tdt-0.6b-v3-npu-mobile
yolo26x	npu	NexaAI/yolo26x-npu-mobile
yolo26l	npu	NexaAI/yolo26l-npu-mobile
yolo26m	npu	NexaAI/yolo26m-npu-mobile
yolo26s	npu	NexaAI/yolo26s-npu-mobile
yolo26n	npu	NexaAI/yolo26n-npu-mobile
depth-anything-v2	npu	NexaAI/depth-anything-v2-npu-mobile

Beyond the NEXA-optimized models listed above, any GGUF model from the community can also run on Qualcomm Hexagon NPU. Use the cpu_gpu plugin and set device_id = "dev0", powered by the GGML Hexagon backend.

Two Ways to Run on NPU

You can run models on Qualcomm Hexagon NPU in two different ways:

1) NEXA models via “npu” plugin

Use the npu plugin
Pick a supported NEXA model from the table above and set model_name accordingly

2) GGUF models via GGML Hexagon backend

Load a GGUF model
Use the cpu_gpu plugin
Set device_id to dev0
Set nGpuLayers > 0 in ModelConfig

LLM Usage

Large Language Models for text generation and chat applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "liquid-v2",
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                    max_tokens = 2048
            ),
        ),
        plugin_id = "npu"
    )
    .build()
    .onSuccess { llmWrapper = it }

val chatList = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    llmWrapper.generateStreamFlow(template.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF model on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "", // GGUF: keep model_name empty
            model_path = "<your-gguf-model-path>", // e.g. /data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"  // Use NPU device to run models like GPT-OSS 20B
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "omni-neural",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048,
                enable_thinking = false
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", <your-text>)
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val config = vlmWrapper.injectMediaPathsToConfig(chatList.toTypedArray(), GenerationConfig())
    vlmWrapper.generateStreamFlow(template.formattedText, config).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF VLM on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "",  // GGUF: keep model_name empty
            model_path = <your-gguf-model-path>,
            mmproj_path = <your-mmproj-path>,  // vision projection weights
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", "<your-text>")
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val baseConfig = GenerationConfig(maxTokens = 2048)
    val configWithMedia = vlmWrapper.injectMediaPathsToConfig(
        chatList.toTypedArray(),
        baseConfig
    )
    vlmWrapper.generateStreamFlow(template.formattedText, configWithMedia).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

Embeddings Usage

Generate vector embeddings for semantic search and RAG applications.

Basic Usage

// Load embedder for NPU inference
EmbedderWrapper.builder()
    .embedderCreateInput(
        EmbedderCreateInput(
            model_name = "embed-gemma",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { embedderWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Generate embeddings for multiple texts
val texts = arrayOf(<your-text1>, <your-text2>, ...)

embedderWrapper.embed(texts, EmbeddingConfig()).onSuccess { embeddings ->
    val dimension = embeddings.size / texts.size
    println("Dimension: $dimension")
    println("First 5 values: ${embeddings.take(5)}")
}

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

// Load ASR model for NPU inference
AsrWrapper.builder()
    .asrCreateInput(
        AsrCreateInput(
            model_name = "parakeet",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { asrWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Transcribe audio file
asrWrapper.transcribe(
    AsrTranscribeInput(
        audioPath = <your-audio-path>,  // Path to .wav, .mp3, etc.
        language = "en",                // Language code: "en", "zh", "es", etc.
        timestamps = null               // Optional timestamp format
    )
).onSuccess { result ->
    println("Transcription: ${result.result.transcript}")
}

Rerank Usage

Improve search relevance by reranking documents based on query relevance.

Basic Usage

// Load reranker model for NPU inference
RerankerWrapper.builder()
    .rerankerCreateInput(
        RerankerCreateInput(
            model_name = "jina-rerank",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { rerankerWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Rerank documents based on query relevance
val query = "What is machine learning?"
val docs = arrayOf("ML is AI subset", "Weather forecast", "Deep learning tutorial")

rerankerWrapper.rerank(query, docs, RerankConfig()).onSuccess { result ->
    result.scores?.withIndex()?.sortedByDescending { it.value }?.forEach { (idx, score) ->
        println("Score: ${"%.4f".format(score)} - ${docs[idx]}")
    }
}

CV Usage

Computer Vision models for OCR, object detection, and image classification.

Basic Usage

// Load PaddleOCR model for NPU inference
CvWrapper.builder()
    .createInput(
        CVCreateInput(
            model_name = "paddleocr",  // Model name
            config = CVModelConfig(
                capabilities = CVCapability.OCR,
                det_model_path = <your-det-model-folder-path>,
                rec_model_path = <your-rec-model-path>,
                char_dict_path = <your-char-dict-path>,
                qnn_model_folder_path = <your-qnn-model-folder-path>,  // For NPU
                qnn_lib_folder_path = <your-qnn-lib-folder-path>       // For NPU
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { cvWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Perform OCR on image
cvWrapper.infer(<your-image-path>).onSuccess { results ->
    results.forEach { result ->
        println("Text: ${result.text}, Confidence: ${result.confidence}")
    }
}

Need Help?

Join our community to get support, share your projects, and connect with other developers.

Discord Community

Get real-time support and chat with the Nexa AI community

Slack Community

Collaborate with developers and access community resources

Was this page helpful?

Yes

Get Started

Nexa CLI Usage

Android SDK

Linux Docker

Python Library

iOS & macOS SDK

Community

Model Name Mapping

Two Ways to Run on NPU

1) NEXA models via “npu” plugin

2) GGUF models via GGML Hexagon backend

LLM Usage

1) NEXA Models (“npu” plugin)

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Multimodal Usage

1) NEXA Models (“npu” plugin)

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Embeddings Usage

Basic Usage

ASR Usage

Basic Usage

Rerank Usage

Basic Usage

CV Usage

Basic Usage

Need Help?

Discord Community

Slack Community

Get Started

Nexa CLI Usage

Android SDK

Linux Docker

Python Library

iOS & macOS SDK

Community

​Model Name Mapping

​Two Ways to Run on NPU

​1) NEXA models via “npu” plugin

​2) GGUF models via GGML Hexagon backend

​LLM Usage

​1) NEXA Models (“npu” plugin)

​2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

​Multimodal Usage

​1) NEXA Models (“npu” plugin)

​2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

​Embeddings Usage

​Basic Usage

​ASR Usage

​Basic Usage

​Rerank Usage

​Basic Usage

​CV Usage

​Basic Usage

​Need Help?

Discord Community

Slack Community

Model Name Mapping

Two Ways to Run on NPU

1) NEXA models via “npu” plugin

2) GGUF models via GGML Hexagon backend

LLM Usage

1) NEXA Models (“npu” plugin)

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Multimodal Usage

1) NEXA Models (“npu” plugin)

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Embeddings Usage

Basic Usage

ASR Usage

Basic Usage

Rerank Usage

Basic Usage

CV Usage

Basic Usage

Need Help?