Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.nexa.ai/llms.txt

Use this file to discover all available pages before exploring further.

Model Name Mapping

For all NPU model, we use an internal namming mapping and please fill in plugin id accordingly.
Model NamePlugin IDHuggingface repository name
omni-neuralnpuNexaAI/OmniNeural-4B-mobile
phi3.5npuNexaAI/phi3.5-mini-npu-mobile
phi4npuNexaAI/phi4-mini-npu-mobile
granite4npuNexaAI/Granite-4-Micro-NPU-mobile
embed-gemmanpuNexaAI/embeddinggemma-300m-npu-mobile
qwen3-4bnpuNexaAI/Qwen3-4B-Instruct-2507-npu-mobile
llama3-3bnpuNexaAI/Llama3.2-3B-NPU-Turbo-NPU-mobile
liquid-v2npuNexaAI/LFM2-1.2B-npu-mobile
paddleocrnpuNexaAI/paddleocr-npu-mobile
parakeetnpuNexaAI/parakeet-tdt-0.6b-v3-npu-mobile
yolo26xnpuNexaAI/yolo26x-npu-mobile
yolo26lnpuNexaAI/yolo26l-npu-mobile
yolo26mnpuNexaAI/yolo26m-npu-mobile
yolo26snpuNexaAI/yolo26s-npu-mobile
yolo26nnpuNexaAI/yolo26n-npu-mobile
depth-anything-v2npuNexaAI/depth-anything-v2-npu-mobile
Beyond the NEXA-optimized models listed above, any GGUF model from the community can also run on Qualcomm Hexagon NPU. Use the cpu_gpu plugin and set device_id = "dev0", powered by the GGML Hexagon backend.

Two Ways to Run on NPU

You can run models on Qualcomm Hexagon NPU in two different ways:

1) NEXA models via “npu” plugin

  • Use the npu plugin
  • Pick a supported NEXA model from the table above and set model_name accordingly

2) GGUF models via GGML Hexagon backend

  • Load a GGUF model
  • Use the cpu_gpu plugin
  • Set device_id to dev0
  • Set nGpuLayers > 0 in ModelConfig

LLM Usage

Large Language Models for text generation and chat applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.
LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "liquid-v2",
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                    max_tokens = 2048
            ),
        ),
        plugin_id = "npu"
    )
    .build()
    .onSuccess { llmWrapper = it }

val chatList = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    llmWrapper.generateStreamFlow(template.formattedText, GenerationConfig()).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF model on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "", // GGUF: keep model_name empty
            model_path = "<your-gguf-model-path>", // e.g. /data/data/<com.your.app>/files/models/gpt-oss-GGUF/gpt-oss-Q4_0.gguf
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"  // Use NPU device to run models like GPT-OSS 20B
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

1) NEXA Models (“npu” plugin)

We support NPU inference for NEXA format models.
VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "omni-neural",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048,
                enable_thinking = false
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", <your-text>)
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val config = vlmWrapper.injectMediaPathsToConfig(chatList.toTypedArray(), GenerationConfig())
    vlmWrapper.generateStreamFlow(template.formattedText, config).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

2) GGUF Models on Hexagon NPU (GGML Hexagon backend)

Run a GGUF VLM on Hexagon NPU by using the cpu_gpu plugin with device_id = "dev0" and setting nGpuLayers > 0.
VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "",  // GGUF: keep model_name empty
            model_path = <your-gguf-model-path>,
            mmproj_path = <your-mmproj-path>,  // vision projection weights
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999 // > 0 enables offloading; 999 attempts to offload all layers
            ),
            plugin_id = "cpu_gpu",
            device_id = "dev0"
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", "<your-text>")
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val baseConfig = GenerationConfig(maxTokens = 2048)
    val configWithMedia = vlmWrapper.injectMediaPathsToConfig(
        chatList.toTypedArray(),
        baseConfig
    )
    vlmWrapper.generateStreamFlow(template.formattedText, configWithMedia).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

Embeddings Usage

Generate vector embeddings for semantic search and RAG applications.

Basic Usage

// Load embedder for NPU inference
EmbedderWrapper.builder()
    .embedderCreateInput(
        EmbedderCreateInput(
            model_name = "embed-gemma",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { embedderWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Generate embeddings for multiple texts
val texts = arrayOf(<your-text1>, <your-text2>, ...)

embedderWrapper.embed(texts, EmbeddingConfig()).onSuccess { embeddings ->
    val dimension = embeddings.size / texts.size
    println("Dimension: $dimension")
    println("First 5 values: ${embeddings.take(5)}")
}

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

// Load ASR model for NPU inference
AsrWrapper.builder()
    .asrCreateInput(
        AsrCreateInput(
            model_name = "parakeet",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { asrWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Transcribe audio file
asrWrapper.transcribe(
    AsrTranscribeInput(
        audioPath = <your-audio-path>,  // Path to .wav, .mp3, etc.
        language = "en",                // Language code: "en", "zh", "es", etc.
        timestamps = null               // Optional timestamp format
    )
).onSuccess { result ->
    println("Transcription: ${result.result.transcript}")
}

Rerank Usage

Improve search relevance by reranking documents based on query relevance.

Basic Usage

// Load reranker model for NPU inference
RerankerWrapper.builder()
    .rerankerCreateInput(
        RerankerCreateInput(
            model_name = "jina-rerank",  // Model name for NPU plugin
            model_path = <your-model-folder-path>,
            tokenizer_path = <your-tokenizer-path>,  // Optional
            config = ModelConfig(
                max_tokens = 2048
            ),
            plugin_id = "npu",  // Use NPU backend
            device_id = null    // Optional device ID
        )
    )
    .build()
    .onSuccess { rerankerWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Rerank documents based on query relevance
val query = "What is machine learning?"
val docs = arrayOf("ML is AI subset", "Weather forecast", "Deep learning tutorial")

rerankerWrapper.rerank(query, docs, RerankConfig()).onSuccess { result ->
    result.scores?.withIndex()?.sortedByDescending { it.value }?.forEach { (idx, score) ->
        println("Score: ${"%.4f".format(score)} - ${docs[idx]}")
    }
}

CV Usage

Computer Vision models for OCR, object detection, and image classification.

Basic Usage

// Load PaddleOCR model for NPU inference
CvWrapper.builder()
    .createInput(
        CVCreateInput(
            model_name = "paddleocr",  // Model name
            config = CVModelConfig(
                capabilities = CVCapability.OCR,
                det_model_path = <your-det-model-folder-path>,
                rec_model_path = <your-rec-model-path>,
                char_dict_path = <your-char-dict-path>,
                qnn_model_folder_path = <your-qnn-model-folder-path>,  // For NPU
                qnn_lib_folder_path = <your-qnn-lib-folder-path>       // For NPU
            ),
            plugin_id = "npu"  // Use NPU backend
        )
    )
    .build()
    .onSuccess { cvWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Perform OCR on image
cvWrapper.infer(<your-image-path>).onSuccess { results ->
    results.forEach { result ->
        println("Text: ${result.text}, Confidence: ${result.confidence}")
    }
}

Need Help?

Join our community to get support, share your projects, and connect with other developers.

Discord Community

Get real-time support and chat with the Nexa AI community

Slack Community

Collaborate with developers and access community resources