Skip to main content

NexaAI Windows ARM64 Setup Guide

This guide demonstrates how to use the NexaAI SDK for various AI inference tasks on NPU devices, including:
  • LLM (Large Language Model): Text generation and conversation
  • VLM (Vision Language Model): Multimodal understanding and generation
  • Embedder: Text vectorization and similarity computation
  • Reranker: Document reranking
  • ASR (Automatic Speech Recognition): Speech-to-text transcription
  • CV (Computer Vision): OCR/text recognition
  • TTS (Text-to-Speech): Text-to-speech synthesis
  • ImageGen: Image generation from text prompts
  • Diarize: Speaker diarization

Prerequisites

1. Install the correct Python version

If you prefer, we also offer a video tutorial for the installation. Check it out here. NexaAI requires Python 3.11 – 3.13 (ARM64 build) on Windows ARM. Please download and install the official ARM64 Python from the python-3.11.1-arm64.exe. Make sure you read the instructions below carefully before proceeding.
IMPORTANT: Make sure you select “Add python.exe to PATH” on the first screen of the installation wizard.
🛑 Make sure you restart the terminal or your IDE after installation.
⚠️ Do not use Conda or x86 builds — they are incompatible with native ARM64 binaries. If you are in a conda environment, run conda deactivate first.
Verify the installation: In case your environment path gets overriden by some environment manager, we recommend you to run the following commands to restore PATH variable from system settings.
$systemPath = [Environment]::GetEnvironmentVariable('Path', 'Machine')
$userPath   = [Environment]::GetEnvironmentVariable('Path', 'User')
$env:Path   = "$userPath;$systemPath"
Then verify your python executable has the correct architecture and version (3.11 - 3.13)
python -c "import sys, platform; print(f'Python version: {sys.version}')"
Your output should look like:
Python version: 3.11.0 (main, Oct 24 2022, 18:15:22) [MSC v.1933 64 bit (ARM64)]
Expected output must contain version 3.11.0 and architecture ARM64. If it does show AMD64 or incorrect version, try the following:
  • (If you have conda installed) Run conda deactivate to deactivate the current conda environment.
  • (If your python executable points to the x86 version) You may need to make the ARM64 Python come before the x86 Python in your PATH.
    • Hit the Win key, and type env, and hit Enter to select Edit the system environment variables setting.
    • Click on Environment Variables... button.
    • Select Path and click Edit....
    • Find your ARM64 Python installation path, and move it to the top of the list.
    • Hit OK for several times to close all the dialogs and save the changes.
  • (If you forgot to select “Add python.exe to PATH” on the first screen of the installation wizard)
    • Run the installation wizard again, follow the instructions to remove the current installation, and then reinstall from the Wizard. Make sure to select “Add python.exe to PATH” this time.

2. Create and activate a virtual environment

python -m venv nexaai-env
nexaai-env\Scripts\activate

3. Install the NexaAI SDK

pip install nexaai

4. Verify Your Environment

Run the following code to ensure you have the right environment:
import sys
import platform

# ANSI color codes
RED = "\033[91m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
BOLD = "\033[1m"
RESET = "\033[0m"

min_ver = (3, 11)
max_ver = (3, 13)
current_ver = sys.version_info
arch = platform.machine()

if not (min_ver <= (current_ver.major, current_ver.minor) < max_ver) or arch.lower() != "arm64":
    print("\n" + "=" * 80)
    print(f"{BOLD}{RED}WARNING: Your Python version or architecture is not compatible.{RESET}")
    print(f"Detected version: {current_ver.major}.{current_ver.minor}, architecture: {arch}")
    print(f"{YELLOW}Required: Python 3.11 - 3.13 & architecture 'arm64'.{RESET}")
    print("=" * 80)
    print(f"{RED}DO NOT continue to the following code!{RESET}\n")
    print("To install arm64 Python:")
    print("  - Download Python 3.11-3.13 for arm64 from https://www.python.org/downloads/")
    print("  - Install and verify by running: python3 --version and python3 -c 'import platform; print(platform.machine())'")
    print("  - Launch Jupyter and make sure to select the arm64 Python kernel in 'Kernel > Change kernel'.")
    sys.exit(1)
else:
    print(f"{GREEN}[VERIFICATION PASSED] Python version and architecture are correct. You may continue to the following sections.{RESET}")

Authentication Setup

Before running any examples, you need to set up your NexaAI authentication token.

Set Token in Code

Replace "YOUR_NEXA_TOKEN_HERE" with your actual NexaAI token from https://sdk.nexa.ai/:
import os

# Replace "YOUR_NEXA_TOKEN_HERE" with your actual token from https://sdk.nexa.ai/
os.environ["NEXA_TOKEN"] = "YOUR_NEXA_TOKEN_HERE"

assert os.environ.get("NEXA_TOKEN", "").startswith(
    "key/"), "ERROR: NEXA_TOKEN must start with 'key/'. Please check your token."

1. LLM (Large Language Model) NPU Inference

Using NPU-accelerated large language models for text generation and conversation. Llama3.2-3B-NPU-Turbo is specifically optimized for NPU.
import io
import os
import logging
from nexaai import LLM, GenerationConfig, ModelConfig, LlmChatMessage, setup_logging

setup_logging(level=logging.DEBUG)


def llm_npu_example():
    """LLM NPU inference example"""
    print("=== LLM NPU Inference Example ===")

    # Model configuration
    # Use huggingface Repo ID
    model_name = "NexaAI/Qwen3-0.6B-GGUF"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\Llama3.2-3B-NPU-Turbo\weights-1-3.nexa")

    max_tokens = 128
    system_message = "You are a helpful assistant."

    print(f"Loading model: {model_name}")

    # Create model instance
    config = ModelConfig()
    llm = LLM.from_(model=model_name, config=config)

    # Create conversation history
    conversation = [LlmChatMessage(role="system", content=system_message)]

    # Example conversations
    test_prompts = [
        "What is artificial intelligence?",
        "Explain the benefits of on-device AI processing.",
        "How does NPU acceleration work?"
    ]

    for i, prompt in enumerate(test_prompts, 1):
        print(f"\n--- Conversation {i} ---")
        print(f"User: {prompt}")

        # Add user message
        conversation.append(LlmChatMessage(role="user", content=prompt))

        # Apply chat template
        formatted_prompt = llm.apply_chat_template(conversation)

        # Generate response
        print("Assistant: ", end="", flush=True)
        response_buffer = io.StringIO()

        gen = llm.generate_stream(formatted_prompt, GenerationConfig(max_tokens=max_tokens))
        result = None
        try:
            while True:
                token = next(gen)
                print(token, end="", flush=True)
                response_buffer.write(token)
        except StopIteration as e:
            result = e.value

        # Get profiling data
        if result and hasattr(result, 'profile_data') and result.profile_data:
            print(f"\n{result.profile_data}")

        # Add assistant response to conversation history
        conversation.append(LlmChatMessage(role="assistant", content=response_buffer.getvalue()))
        print("\n" + "=" * 50)


llm_npu_example()

2. VLM (Vision Language Model) NPU Inference

Using NPU-accelerated vision language models for multimodal understanding and generation. OmniNeural-4B supports joint processing of images and text.
import os
import io
import logging
from nexaai import VLM, GenerationConfig, ModelConfig, VlmChatMessage, VlmContent, setup_logging

setup_logging(level=logging.DEBUG)


def vlm_npu_example():
    """VLM NPU inference example"""
    print("=== VLM NPU Inference Example ===")

    # Model configuration
    # Use huggingface repo ID
    model_name = "~/.cache/nexa.ai/nexa_sdk/models/NexaAI/gemma-3n-E4B-it-4bit-MLX/model-00001-of-00002.safetensors"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\OmniNeural-4B\weights-1-8.nexa")
    
    max_tokens = 128
    system_message = "You are a helpful assistant that can understand images and text."
    image_path = '/your/image/path'  # Replace with actual image path if available

    print(f"Loading model: {model_name}")

    # Check for image existence
    if not (image_path and os.path.exists(image_path)):
        print(f"WARNING: The specified image_path ('{image_path}') does not exist or was not provided. Multimodal prompts will not include image input.")

    # Create model instance
    config = ModelConfig()
    vlm = VLM.from_(model=model_name, config=config)

    # Create conversation history
    conversation = [VlmChatMessage(role="system",
                                    contents=[VlmContent(type="text", text=system_message)])]

    # Example multimodal conversations
    test_cases = [
        {
            "text": "What do you see in this image?",
            "image_path": image_path
        }
    ]

    for i, case in enumerate(test_cases, 1):
        print(f"\n--- Multimodal Conversation {i} ---")
        print(f"User: {case['text']}")

        # Build message content
        contents = [VlmContent(type="text", text=case['text'])]

        # Add image content if available
        if case['image_path'] and os.path.exists(case['image_path']):
            contents.append(VlmContent(type="image", text=case['image_path']))
            print(f"Including image: {case['image_path']}")

        # Add user message
        conversation.append(VlmChatMessage(role="user", contents=contents))

        # Apply chat template
        formatted_prompt = vlm.apply_chat_template(conversation)

        # Generate response
        print("Assistant: ", end="", flush=True)
        response_buffer = io.StringIO()

        # Prepare image and audio paths
        image_paths = [case['image_path']] if case['image_path'] and os.path.exists(case['image_path']) else None
        audio_paths = None

        gen = vlm.generate_stream(formatted_prompt,
                                  GenerationConfig(max_tokens=max_tokens, image_paths=image_paths, audio_paths=audio_paths))
        result = None
        try:
            while True:
                token = next(gen)
                print(token, end="", flush=True)
                response_buffer.write(token)
        except StopIteration as e:
            result = e.value

        # Get profiling data
        if result and hasattr(result, 'profile_data') and result.profile_data:
            print(f"\n{result.profile_data}")

        # Add assistant response to conversation history
        conversation.append(VlmChatMessage(role="assistant",
                                            contents=[VlmContent(type="text", text=response_buffer.getvalue())]))
        print("\n" + "=" * 50)


vlm_npu_example()

3. Embedder NPU Inference

Using NPU-accelerated embedding models for text vectorization and similarity computation. embeddinggemma-300m-npu is a lightweight embedding model specifically optimized for NPU.
import logging
from nexaai import Embedder, setup_logging

setup_logging(level=logging.DEBUG)


def embedder_npu_example():
    """Embedder NPU inference example"""
    print("=== Embedder NPU Inference Example ===")

    # Model configuration
    # Use huggingface repo ID
    model_name = "~/.cache/nexa.ai/nexa_sdk/models/NexaAI/jina-v2-fp16-mlx/model.safetensors"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\embeddinggemma-300m-npu\weights-1-2.nexa")

    batch_size = None  # Use default or len(texts)

    print(f"Loading model: {model_name}")

    # Create embedder instance
    embedder = Embedder.from_(model=model_name)
    print('Embedder loaded successfully!')

    # Get embedding dimension
    dim = embedder.embedding_dim()
    print(f"Embedding dimension: {dim}")

    # Example texts
    texts = [
        "On-device AI is a type of AI that is processed on the device itself, rather than in the cloud.",
        "Nexa AI allows you to run state-of-the-art AI models locally on CPU, GPU, or NPU — from instant use cases to production deployments.",
        "A ragdoll is a breed of cat that is known for its long, flowing hair and gentle personality.",
        "The capital of France is Paris.",
    ]

    query = "what is on device AI"

    print(f"\n=== Generating Embeddings ===")
    print(f"Processing {len(texts)} texts...")

    # Generate embeddings
    result = embedder.embed(texts=texts, batch_size=batch_size or len(texts))
    embeddings = result.embeddings

    print(f"Successfully generated {len(embeddings)} embeddings")

    # Display embedding information
    print(f"\n=== Embedding Details ===")
    for i, (text, embedding) in enumerate(zip(texts, embeddings)):
        print(f"\nText {i + 1}:")
        print(f"  Content: {text}")
        print(f"  Embedding shape: {len(embedding)} dimensions")
        print(f"  First 10 elements: {embedding[:10]}")
        print("-" * 70)

    # Query processing
    print(f"\n=== Query Processing ===")
    print(f"Query: '{query}'")

    query_result = embedder.embed(texts=[query], batch_size=1)
    query_embedding = query_result.embeddings[0]

    print(f"Query embedding shape: {len(query_embedding)} dimensions")

    # Similarity analysis
    print(f"\n=== Similarity Analysis (Inner Product) ===")
    similarities = []

    for i, (text, embedding) in enumerate(zip(texts, embeddings)):
        inner_product = sum(a * b for a, b in zip(query_embedding, embedding))
        similarities.append((i, text, inner_product))

        print(f"\nText {i + 1}:")
        print(f"  Content: {text}")
        print(f"  Inner product with query: {inner_product:.6f}")
        print("-" * 70)

    # Sort and display most similar texts
    similarities.sort(key=lambda x: x[2], reverse=True)

    print(f"\n=== Similarity Ranking Results ===")
    for rank, (idx, text, score) in enumerate(similarities, 1):
        print(f"Rank {rank}: [{score:.6f}] {text}")

    return embeddings, query_embedding, similarities


embeddings, query_emb, similarities = embedder_npu_example()

4. ASR (Automatic Speech Recognition) NPU Inference

Using NPU-accelerated speech recognition models for speech-to-text transcription. parakeet-npu provides high-quality speech recognition with NPU acceleration.
import os
import logging
from nexaai import ASR, setup_logging

setup_logging(level=logging.DEBUG)


def asr_npu_example():
    """ASR NPU inference example"""
    print("=== ASR NPU Inference Example ===")

    # Model configuration
    # Use huggingface Repo ID
    model_name = "NexaAI/parakeet-npu"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\parakeet-npu\weights-1-5.nexa")

    # Example audio file (replace with your actual audio file)
    audio_file = r"path/to/audio"  # Replace with actual audio file path

    print(f"Loading model: {model_name}")

    if not os.path.exists(audio_file):
        print(f"ERROR: The specified audio_file ('{audio_file}') does not exist. Please provide a valid audio file path to test ASR functionality.")
        return None

    # Create ASR instance
    asr = ASR.from_(model=model_name)
    print('ASR model loaded successfully!')

    print(f"\n=== Starting Transcription ===")

    # Perform transcription
    result = asr.transcribe(
        audio_path=audio_file,
        language="en",
        timestamps="segment",
        beam_size=5
    )

    # Display results
    print(f"\n=== Transcription Results ===")
    print(result.transcript)

    return result


result = asr_npu_example()

5. Reranker NPU Inference

Using NPU-accelerated reranking models for document reranking. jina-v2-rerank-npu can perform precise similarity-based document ranking based on queries.
import logging
from nexaai import Reranker, setup_logging

setup_logging(level=logging.DEBUG)


def reranker_npu_example():
    """Reranker NPU inference example"""
    print("=== Reranker NPU Inference Example ===")

    # Model configuration
    # Use huggingface repo ID
    model_name = "~/.cache/nexa.ai/nexa_sdk/models/NexaAI/jina-v2-rerank-mlx/jina-reranker-v2-base-multilingual-f16.safetensors"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\jina-v2-rerank-npu\weights-1-4.nexa")

    batch_size = None  # Use default or len(documents)

    print(f"Loading model: {model_name}")

    # Create reranker instance
    reranker = Reranker.from_(model=model_name)
    print('Reranker loaded successfully!')

    # Example query and documents
    query = "Where is on-device AI?"
    documents = [
        "On-device AI is a type of AI that is processed on the device itself, rather than in the cloud.",
        "edge computing",
        "A ragdoll is a breed of cat that is known for its long, flowing hair and gentle personality.",
        "The capital of France is Paris.",
    ]

    print(f"Query: {query}")
    print(f"Documents: {len(documents)} documents")
    print("-" * 50)

    # Perform reranking
    result = reranker.rerank(
        query=query,
        documents=documents,
        batch_size=batch_size or len(documents)
    )
    scores = result.scores

    # Display ranking results
    for i, score in enumerate(scores):
        print(f"[{score:.4f}] : {documents[i]}")

    return reranker


reranker = reranker_npu_example()

6. Computer Vision (CV) NPU Inference

Run NPU-accelerated computer vision tasks (e.g., OCR/text recognition) on images.
import os
import logging
from nexaai import CV, setup_logging

setup_logging(level=logging.DEBUG)


def cv_ocr_example():
    """CV OCR Inference example"""
    print("=== CV OCR Inference Example ===")

    # Use huggingface repo ID
    model_name = "NexaAI/paddleocr-npu"
    # Alternatively, use local path
    # model_name = os.path.expanduser(r"~\.cache\nexa.ai\nexa_sdk\models\NexaAI\paddleocr-npu\weights-1-1.nexa")

    image_path = r"path/to/image.png"

    if not os.path.exists(image_path):
        print(f"ERROR: Image file not found: {image_path}")
        return None

    cv = CV.from_(model=model_name, capabilities=0, plugin_id=None)  # 0=OCR

    results = cv.infer(image_path)

    print(f"Number of results: {len(results.results)}")
    for result in results.results:
        print(f"[{result.confidence:.2f}] {result.text}")


cv_ocr_example()

Next Steps