NPU

模型名称映射

对于所有 NPU 模型，我们使用内部名称映射，并需要填写对应的插件 ID。

模型名称	插件 ID	HuggingFace 仓库名
omni-neural	npu	NexaAI/OmniNeural-4B-mobile
phi3.5	npu	NexaAI/phi3.5-mini-npu-mobile
phi4	npu	NexaAI/phi4-mini-npu-mobile
granite4	npu	NexaAI/Granite-4-Micro-NPU-mobile
embed-gemma	npu	NexaAI/embeddinggemma-300m-npu-mobile
qwen3-4b	npu	NexaAI/Qwen3-4B-Instruct-2507-npu-mobile
llama3-3b	npu	NexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile
liquid-v2	npu	NexaAI/LFM2-1.2B-npu-mobile
paddleocr	npu	NexaAI/paddleocr-npu-mobile
parakeet	npu	NexaAI/parakeet-tdt-0.6b-v3-npu-mobile
yolo26x	npu	NexaAI/yolo26x-npu-mobile
yolo26l	npu	NexaAI/yolo26l-npu-mobile
yolo26m	npu	NexaAI/yolo26m-npu-mobile
yolo26s	npu	NexaAI/yolo26s-npu-mobile
yolo26n	npu	NexaAI/yolo26n-npu-mobile
depth-anything-v2	npu	NexaAI/depth-anything-v2-npu-mobile

除以上 NEXA 优化模型外，任何社区 GGUF 模型也可在高通 Hexagon NPU 上运行。使用 cpu_gpu 插件并设置 device_id = "dev0"，由 GGML Hexagon 后端驱动。

在 NPU 上运行的两种方式

你可以通过两种不同方式在 Qualcomm Hexagon NPU 上运行模型：

1) NEXA 模型（通过 “npu” 插件）

使用 npu 插件
从上表中选择支持的 NEXA 模型，并相应设置 model_name

2) GGUF 模型（通过 GGML Hexagon 后端）

加载 GGUF 模型
使用 cpu_gpu 插件
设置 device_id 为 dev0
在 ModelConfig 中设置 nGpuLayers > 0

LLM 用法

适用于文本生成与聊天应用的大语言模型。

1) NEXA 模型（“npu” 插件）

支持 NEXA 格式模型的 NPU 推理。

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "liquid-v2",
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                    max_tokens = 2048
            ),
        ),
        plugin_id = "npu"
    )
    .build()
    .onSuccess { llmWrapper = it }

val chatList = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    llmWrapper.generateStreamFlow(template.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

通过使用 cpu_gpu 插件，设置 device_id = "dev0" 并设置 nGpuLayers > 0，在 Hexagon NPU 上运行 GGUF 模型。

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "", // GGUF：model_name 保持为空
            model_path = "<your-gguf-model-path>", // 例如：/data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 启用卸载；999 尝试卸载所有层
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"  // 使用 NPU 设备运行模型（如 GPT-OSS 20B）
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

多模态用法

用于图像理解与多模态应用的视觉语言模型。

1) NEXA 模型（“npu” 插件）

支持 NEXA 格式模型的 NPU 推理。

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "omni-neural",  // NPU 插件的模型名称
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048,
                enable_thinking = false
            ),
            plugin_id = "npu"  // 使用 NPU 后端
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// 使用已加载的 VLM 进行图像 + 文本推理
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", <your-text>)
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val config = vlmWrapper.injectMediaPathsToConfig(chatList.toTypedArray(), GenerationConfig())
    vlmWrapper.generateStreamFlow(template.formattedText, config).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

通过使用 cpu_gpu 插件，设置 device_id = "dev0" 并设置 nGpuLayers > 0，在 Hexagon NPU 上运行 GGUF VLM。

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "",  // GGUF：model_name 保持为空
            model_path = <your-gguf-model-path>,
            mmproj_path = <your-mmproj-path>,  // 视觉投影权重
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 启用卸载；999 尝试卸载所有层
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// 使用已加载的 VLM 进行图像 + 文本推理
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", "<your-text>")
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val baseConfig = GenerationConfig(maxTokens = 2048)
    val configWithMedia = vlmWrapper.injectMediaPathsToConfig(
        chatList.toTypedArray(),
        baseConfig
    )
    vlmWrapper.generateStreamFlow(template.formattedText, configWithMedia).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

嵌入用法

用于语义搜索与 RAG 应用的向量嵌入。

基本用法

// 加载 NPU 推理的嵌入模型
EmbedderWrapper.builder()
    .embedderCreateInput(
        EmbedderCreateInput(
            model_name = "embed-gemma",  // NPU 插件的模型名称
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // 可选
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // 使用 NPU 后端
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { embedderWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// 为多段文本生成向量嵌入
val texts = arrayOf(<your-text1>, <your-text2>, ...)

embedderWrapper.embed(texts, EmbeddingConfig()).onSuccess { embeddings ->
    val dimension = embeddings.size / texts.size
    println("Dimension: $dimension")
    println("First 5 values: ${embeddings.take(5)}")
}

ASR 用法

用于音频转写的自动语音识别。

基本用法

// 加载 NPU 推理的 ASR 模型
AsrWrapper.builder()
    .asrCreateInput(
        AsrCreateInput(
            model_name = "parakeet",  // NPU 插件的模型名称
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu"  // 使用 NPU 后端
        )
    )
    .build()
    .onSuccess { asrWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Transcribe audio file
asrWrapper.transcribe(
    AsrTranscribeInput(
        audioPath = <your-audio-path>,  // 音频文件路径（.wav、.mp3 等）
        language = "en",                // 语言代码："en"、"zh"、"es" 等
        timestamps = null               // 可选时间戳格式
    )
).onSuccess { result ->
    println("Transcription: ${result.result.transcript}")
}

重排用法

根据查询相关性对文档进行重排，提升检索相关性。

基本用法

// 加载 NPU 推理的重排模型
RerankerWrapper.builder()
    .rerankerCreateInput(
        RerankerCreateInput(
            model_name = "jina-rerank",  // NPU 插件的模型名称
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // 使用 NPU 后端
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { rerankerWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// 根据查询相关性对文档进行重排
val query = "What is machine learning?"
val docs = arrayOf("ML is AI subset", "Weather forecast", "Deep learning tutorial")

rerankerWrapper.rerank(query, docs, RerankConfig()).onSuccess { result ->
    result.scores?.withIndex()?.sortedByDescending { it.value }?.forEach { (idx, score) ->
        println("Score: ${"%.4f".format(score)} - ${docs[idx]}")
    }
}

CV 用法

用于 OCR、目标检测与图像分类的计算机视觉模型。

基本用法

// 加载 NPU 推理的 PaddleOCR 模型
CvWrapper.builder()
    .createInput(
        CVCreateInput(
            model_name = "paddleocr",  // 模型名称
            config = CVModelConfig(
                capabilities = CVCapability.OCR,
                det_model_path = <your-det-model-folder-path>,
                rec_model_path = <your-rec-model-path>,
                char_dict_path = <your-char-dict-path>,
                qnn_model_folder_path = <your-qnn-model-folder-path>,  // 用于 NPU
                qnn_lib_folder_path = <your-qnn-lib-folder-path>       // 用于 NPU
            ),
            plugin_id = "npu"  // 使用 NPU 后端
        )
    )
    .build()
    .onSuccess { cvWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// 对图像执行 OCR
cvWrapper.infer(<your-image-path>).onSuccess { results ->
    results.forEach { result ->
        println("Text: ${result.text}, Confidence: ${result.confidence}")
    }
}

需要帮助？

加入我们的社区获取支持、分享项目并与其他开发者交流。

Discord 社区

获取实时支持并与 Nexa AI 社区交流

Slack 社区

与开发者协作并访问社区资源

Was this page helpful?

Yes

快速开始

Nexa CLI 使用

Android SDK

Linux Docker

Python 库

iOS & macOS SDK

社区

模型名称映射

在 NPU 上运行的两种方式

1) NEXA 模型（通过 “npu” 插件）

2) GGUF 模型（通过 GGML Hexagon 后端）

LLM 用法

1) NEXA 模型（“npu” 插件）

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

多模态用法

1) NEXA 模型（“npu” 插件）

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

嵌入用法

基本用法

ASR 用法

基本用法

重排用法

基本用法

CV 用法

基本用法

需要帮助？

Discord 社区

Slack 社区

快速开始

Nexa CLI 使用

Android SDK

Linux Docker

Python 库

iOS & macOS SDK

社区

​模型名称映射

​在 NPU 上运行的两种方式

​1) NEXA 模型（通过 “npu” 插件）

​2) GGUF 模型（通过 GGML Hexagon 后端）

​LLM 用法

​1) NEXA 模型（“npu” 插件）

​2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

​多模态用法

​1) NEXA 模型（“npu” 插件）

​2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

​嵌入用法

​基本用法

​ASR 用法

​基本用法

​重排用法

​基本用法

​CV 用法

​基本用法

​需要帮助？

Discord 社区

Slack 社区

模型名称映射

在 NPU 上运行的两种方式

1) NEXA 模型（通过 “npu” 插件）

2) GGUF 模型（通过 GGML Hexagon 后端）

LLM 用法

1) NEXA 模型（“npu” 插件）

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

多模态用法

1) NEXA 模型（“npu” 插件）

2) 在 Hexagon NPU 上运行 GGUF 模型（GGML Hexagon 后端）

嵌入用法

基本用法

ASR 用法

基本用法

重排用法

基本用法

CV 用法

基本用法

需要帮助？