Skip to main content

Model Name Mapping

For all NPU model, we use an internal namming mapping and please fill in plugin id accordingly.
Model NamePlugin IDHuggingface repository name
omni-neuralnpuNexaAI/OmniNeural-4B-mobile
phi3.5npuNexaAI/phi3.5-mini-npu-mobile
phi4npuNexaAI/phi4-mini-npu-mobile
granite4npuNexaAI/Granite-4-Micro-NPU-mobile
embed-gemmanpuNexaAI/embeddinggemma-300m-npu-mobile
qwen3-4bnpuNexaAI/Qwen3-4B-Instruct-2507-npu-mobile
llama3-3bnpuNexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile
liquid-v2npuNexaAI/LFM2-1.2B-npu-mobile
paddleocrnpuNexaAI/paddleocr-npu-mobile

Two Ways to Run on NPU

You can run models on Qualcomm Hexagon NPU in two different ways:

1) NEXA models via “npu” plugin

  • Use the npu plugin
  • Pick a supported NEXA model from the table above and set model_name accordingly

2) GGUF models via GGML Hexagon backend

  • Load a GGUF model
  • Use the cpu_gpu plugin
  • Set device_id to HTP0
  • Set nGpuLayers > 0 in ModelConfig

LLM Usage

Large Language Models for text generation and chat applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.
LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "liquid-v2",
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                    max_tokens = 2048
            ),
        ),
        plugin_id = "npu"
    )
    .build()
    .onSuccess { llmWrapper = it }

val chatList = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    llmWrapper.generateStreamFlow(template.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF model on Hexagon NPU by using the cpu_gpu plugin with device_id = "HTP0,HTP1,HTP2,HTP3" and setting nGpuLayers > 0.
LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "", // GGUF: keep model_name empty
            model_path = "<your-gguf-model-path>", // e.g. /data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "HTP0,HTP1,HTP2,HTP3"  // Use 4 logical NPU devices to run larger models like GPT-OSS 20B, this is currently only supported for LLM
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.
VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "omni-neural",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048,
                enable_thinking = false
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", <your-text>)
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val config = vlmWrapper.injectMediaPathsToConfig(chatList.toTypedArray(), GenerationConfig())
    vlmWrapper.generateStreamFlow(template.formattedText, config).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF VLM on Hexagon NPU by using the cpu_gpu plugin with device_id = "HTP0" and setting nGpuLayers > 0.
VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "",  // GGUF: keep model_name empty
            model_path = <your-gguf-model-path>,
            mmproj_path = <your-mmproj-path>,  // vision projection weights
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "HTP0"
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", "<your-text>")
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val baseConfig = GenerationConfig(maxTokens = 2048)
    val configWithMedia = vlmWrapper.injectMediaPathsToConfig(
        chatList.toTypedArray(),
        baseConfig
    )
    vlmWrapper.generateStreamFlow(template.formattedText, configWithMedia).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

Embeddings Usage

Generate vector embeddings for semantic search and RAG applications.

Basic Usage

// Load embedder for NPU inference
EmbedderWrapper.builder()
    .embedderCreateInput(
        EmbedderCreateInput(
            model_name = "embed-gemma",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { embedderWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Generate embeddings for multiple texts
val texts = arrayOf(<your-text1>, <your-text2>, ...)

embedderWrapper.embed(texts, EmbeddingConfig()).onSuccess { embeddings ->
    val dimension = embeddings.size / texts.size
    println("Dimension: $dimension")
    println("First 5 values: ${embeddings.take(5)}")
}

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

// Load ASR model for NPU inference
AsrWrapper.builder()
    .asrCreateInput(
        AsrCreateInput(
            model_name = "parakeet",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { asrWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Transcribe audio file
asrWrapper.transcribe(
    AsrTranscribeInput(
        audioPath = <your-audio-path>,  // Path to .wav, .mp3, etc.
        language = "en",                // Language code: "en", "zh", "es", etc.
        timestamps = null               // Optional timestamp format
    )
).onSuccess { result ->
    println("Transcription: ${result.result.transcript}")
}

Rerank Usage

Improve search relevance by reranking documents based on query relevance.

Basic Usage

// Load reranker model for NPU inference
RerankerWrapper.builder()
    .rerankerCreateInput(
        RerankerCreateInput(
            model_name = "jina-rerank",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { rerankerWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Rerank documents based on query relevance
val query = "What is machine learning?"
val docs = arrayOf("ML is AI subset", "Weather forecast", "Deep learning tutorial")

rerankerWrapper.rerank(query, docs, RerankConfig()).onSuccess { result ->
    result.scores?.withIndex()?.sortedByDescending { it.value }?.forEach { (idx, score) ->
        println("Score: ${"%.4f".format(score)} - ${docs[idx]}")
    }
}

CV Usage

Computer Vision models for OCR, object detection, and image classification.

Basic Usage

// Load PaddleOCR model for NPU inference
CvWrapper.builder()
    .createInput(
        CVCreateInput(
            model_name = "paddleocr",  // Model name
            config = CVModelConfig(
                capabilities = CVCapability.OCR,
                det_model_path = <your-det-model-folder-path>,
                rec_model_path = <your-rec-model-path>,
                char_dict_path = <your-char-dict-path>,
                qnn_model_folder_path = <your-qnn-model-folder-path>,  // For NPU
                qnn_lib_folder_path = <your-qnn-lib-folder-path>       // For NPU
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { cvWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Perform OCR on image
cvWrapper.infer(<your-image-path>).onSuccess { results ->
    results.forEach { result ->
        println("Text: ${result.text}, Confidence: ${result.confidence}")
    }
}

Need Help?

Join our community to get support, share your projects, and connect with other developers.