Skip to main content

Tutorial Video

Installation

Pull Docker Image

Pull the latest NexaSDK Docker image from Docker Hub:
bash
docker pull nexa4ai/nexasdk:latest

Usage Modes

NexaSDK Docker supports two usage modes: server mode for REST API access and interactive CLI mode for direct model inference.

CLI Mode (Interactive)

Run NexaSDK in interactive CLI mode for direct model inference:
bash
export NEXA_TOKEN="YOUR_LONG_TOKEN_HERE"
docker run --rm -it --privileged \
  -v /path/to/data:/data \
  -v /etc/machine-id:/etc/machine-id:ro \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest infer NexaAI/Granite-4.0-h-350M-NPU
The -it flags enable interactive mode. Replace NexaAI/Granite-4.0-h-350M-NPU with any supported model name. For a complete list of supported models, see the Overview page.

Server Mode (Detached)

Run NexaSDK in server mode to expose a REST API endpoint. This mode runs in the background:
bash
export NEXA_TOKEN="YOUR_LONG_TOKEN_HERE"
docker run --rm -it --privileged \
  -v /path/to/data:/data \
  -v /etc/machine-id:/etc/machine-id:ro \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest pull [MODEL_NAME]
docker run --rm -d -p 18181:18181 --privileged \
  -v /path/to/data:/data \
  -v /etc/machine-id:/etc/machine-id:ro \
  -e NEXA_TOKEN \
  nexa4ai/nexasdk:latest serve
The --privileged flag is required for NPU access. Replace YOUR_LONG_TOKEN_HERE with your actual Nexa token. You can obtain a token by creating an account at sdk.nexa.ai and generating one in Deployment → Create Token.
nexa serve won’t auto download models. Make sure to pre-download the models you intend to use. The server will be accessible at http://localhost:18181. For detailed API documentation, see the REST API page.

Next Steps

REST API

Learn how to use the REST API endpoints for chat completions, embeddings, reranking, and more.

NPU Models

Explore the full collection of NPU-optimized models available for Qualcomm devices.