Run nexa commands from the nexa executable directory.

nexa pull

Download a model and store it locally. After entering the pull command, you will be guided through a setup process to choose the model type, main model file, tokenizer (optional), and extra files (optional).

General Behavior

After running nexa pull <model-name>, the CLI will prompt:
  1. Quant version selection
    If the model supports multiple quantized versions, you will see a menu like this:
    Quant version selection
    Choose a quant version to download
    > Q4_K_M     [1.2 GiB] (default)
      Q8_0       [2.0 GiB]
      F16        [3.8 GiB]
    
    Select a quant version you prefer.
  2. Download begins
    After selection, the model files will start downloading automatically.

LLM

bash
nexa pull ggml-org/Qwen3-1.7B-GGUF

VLM

bash
nexa pull ggml-org/gemma-3-4b-it-GGUF

AudioLM

bash
nexa pull ggml-org/Qwen2.5-Omni-3B-GGUF

ASR

bash
nexa pull mlx-community/whisper-tiny

TTS

bash
nexa pull nexaml/Kokoro-82M-bf16-MLX

Embedder

bash
nexa pull djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF

Reranker

bash
nexa pull pqnet/bge-reranker-v2-m3-Q8_0-GGUF

nexa list

Display all downloaded models in a table with their names and sizes.
bash
nexa list

nexa remove

Remove a specific local model by name. For example, remove the locally downloaded model ggml-org/Qwen3-0.6B-GGUF from the cache directory. This will free up disk space and make the model unavailable for future inference unless re-downloaded.
bash
nexa remove ggml-org/Qwen3-0.6B-GGUF

nexa clean

Delete all locally cached models.
bash
nexa clean

nexa infer

Run inference with a specified model. The model must be downloaded and cached locally.

Helper menu

bash
nexa infer -h
Show help menu for nexa infer.

Flags

bash
-m, --model-type string         specify model type [llm/vlm/embedder/reranker/tts/asr] (default "llm")
-s, --disable-stream            [llm|vlm] disable stream mode
-t, --tool strings              [llm|vlm] add tool to make function call
-p, --prompt strings            [embedder|tts] pass prompt
-q, --query string              [reranker] query
-d, --document strings          [reranker] documents
-i, --input string              [asr] input file (audio for asr)
-o, --output string             [tts] output file (audio for tts)
    --voice-identifier string   [tts] voice identifier
    --speech-speed float        [tts] speech speed (1.0 = normal) (default 1)
-l, --language string           [asr] language code (e.g., en, zh, ja)
-h, --help                      help for infer

LLM

Launch an interactive chat session with the language model.
bash
nexa infer ggml-org/Qwen3-1.7B-GGUF

VLM

Text-only or response from image file (interactive image input):
bash
nexa infer ggml-org/gemma-3-4b-it-GGUF
If you only want text input, simply launch the command and begin chatting.
If you’d like the model to response from an image, provide the absolute path to an image at the end of your message.
Example prompt: Describe this picture </path/to/image.png>

AudioLM

Text-only or response from audio file (interactive audio output):
bash
nexa infer ggml-org/Qwen2.5-Omni-3B-GGUF
If you only want text output, start chatting as usual.
If you’d like the model to response from an audio, provide an absolute path to an audio at the end of your message.
Example prompt: Convert this audio into text </path/to/audio.mp3>

ASR

Currently, ASR is only supported on macOS using the mlx runtime.
Use ASR models to transcribe speech from audio files into text.
bash
nexa infer -m asr mlx-community/whisper-tiny  --input < /path/to/audio.wav > --language en
  • -m asr : Sets the model type to ASR.
  • --input : Specifies the input audio file.
  • --language : Sets the language code (e.g., en for English, zh for Chinese).

TTS

Currently, TTS is only supported on macOS using the mlx runtime.
Use TTS models to convert input text into spoken audio.
bash
nexa infer nexaml/Kokoro-82M-bf16-MLX -m tts --voice-identifier zm_yunyang -p "Hello world this is a text to speech test" -o < /path/to/audio.wav >
  • -m TTS : Sets the model type to TTS.
  • --voice-identifier: Specifies the speaker’s voice.
    When no --voice-identifier is provided, NexaCLI will return a full list of supported voices in the error message. This is useful for discovering all available voice options.
  • -p: The text prompt to synthesize.
  • -o: Output file for the generated .wav audio.

Embedder

Generate embeddings for multiple pieces of text using an embedding model.
bash
nexa infer djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF -m embedder --prompt "translate to text" --prompt "second"
  • -m embedder : Sets the model type to Embedder.
  • --prompt : Provide one or more pieces of text to embed.

Reranker

Use a reranker model to score and sort documents based on relevance to a query.
bash
nexa infer pqnet/bge-reranker-v2-m3-Q8_0-GGUF -m reranker --query "query" --document "a" --document "query"
  • -m reranker : Sets the model type to Reranker.
  • --query : The main query string used to evaluate document relevance.
  • --document : One or more documents to score against the query.

nexa serve

Launch the Nexa inference server using REST API.

Helper menu

bash
nexa serve -h
Show help menu for nexa serve.

Start serve

For example, start a local inference server bound to 127.0.0.1:8080. The server supports OpenAI-compatible APIs, and —keepalive 600 keeps models in memory for 10 minutes between requests.
bash
nexa serve --host 127.0.0.1:8080 --keepalive 600
You can test it via:
curl -X POST http://127.0.0.1:8080/v1/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/Qwen3-1.7B-GGUF\", \"prompt\": \"What is Nexa?\", \"max_tokens\": 100}"
This will send a POST request to the local /v1/completions endpoint to prompt the Qwen/Qwen3-0.6B-GGUF model with “What is Nexa?” and return a response with up to 100 tokens.

nexa run

Connect to a running Nexa server (via OpenAI-compatible API) and start a chat interface. You should start server first.

Helper menu

bash
nexa run -h
Show help menu for nexa run.

Run model

For example: launch an interactive streaming chat session with the Qwen/Qwen3-0.6B-GGUF model. The model generates and displays output incrementally as tokens are produced.
bash
nexa run ggml-org/Qwen3-1.7B-GGUF
--disable-stream|-s: disable streaming and respond the entire json back.