CLI Reference

Run nexa commands from the nexa executable directory.

nexa pull

Download a model and store it locally. After entering the pull command, you will be guided through a setup process to choose the model type, main model file, tokenizer (optional), and extra files (optional).

General Behavior

After running nexa pull <model-name>, the CLI will prompt:

Quant version selection
If the model supports multiple quantized versions, you will see a menu like this:
Quant version selection
```
Choose a quant version to download
> Q4_K_M     [1.2 GiB] (default)
  Q8_0       [2.0 GiB]
  F16        [3.8 GiB]
```
Select a quant version you prefer.
Download begins
After selection, the model files will start downloading automatically.

LLM

bash

nexa pull ggml-org/Qwen3-1.7B-GGUF

VLM

bash

nexa pull ggml-org/gemma-3-4b-it-GGUF

AudioLM

bash

nexa pull ggml-org/Qwen2.5-Omni-3B-GGUF

ASR

bash

nexa pull mlx-community/whisper-tiny

TTS

bash

nexa pull nexaml/Kokoro-82M-bf16-MLX

Embedder

bash

nexa pull djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF

Reranker

bash

nexa pull pqnet/bge-reranker-v2-m3-Q8_0-GGUF

nexa list

Display all downloaded models in a table with their names and sizes.

bash

nexa list

nexa remove

Remove a specific local model by name. For example, remove the locally downloaded model ggml-org/Qwen3-0.6B-GGUF from the cache directory. This will free up disk space and make the model unavailable for future inference unless re-downloaded.

bash

nexa remove ggml-org/Qwen3-0.6B-GGUF

nexa clean

Delete all locally cached models.

bash

nexa clean

nexa infer

Run inference with a specified model. The model must be downloaded and cached locally.

bash

nexa infer -h

Show help menu for nexa infer.

Flags

bash

-m, --model-type string         specify model type [llm/vlm/embedder/reranker/tts/asr] (default "llm")
-s, --disable-stream            [llm|vlm] disable stream mode
-t, --tool strings              [llm|vlm] add tool to make function call
-p, --prompt strings            [embedder|tts] pass prompt
-q, --query string              [reranker] query
-d, --document strings          [reranker] documents
-i, --input string              [asr] input file (audio for asr)
-o, --output string             [tts] output file (audio for tts)
    --voice-identifier string   [tts] voice identifier
    --speech-speed float        [tts] speech speed (1.0 = normal) (default 1)
-l, --language string           [asr] language code (e.g., en, zh, ja)
-h, --help                      help for infer

LLM

Launch an interactive chat session with the language model.

bash

nexa infer ggml-org/Qwen3-1.7B-GGUF

VLM

Text-only or response from image file (interactive image input):

bash

nexa infer ggml-org/gemma-3-4b-it-GGUF

If you only want text input, simply launch the command and begin chatting.
If you’d like the model to response from an image, provide the absolute path to an image at the end of your message.
Example prompt: Describe this picture </path/to/image.png>

AudioLM

Text-only or response from audio file (interactive audio output):

bash

nexa infer ggml-org/Qwen2.5-Omni-3B-GGUF

If you only want text output, start chatting as usual.
If you’d like the model to response from an audio, provide an absolute path to an audio at the end of your message.
Example prompt: Convert this audio into text </path/to/audio.mp3>

ASR

Currently, ASR is only supported on macOS using the mlx runtime.

Use ASR models to transcribe speech from audio files into text.

bash

nexa infer -m asr mlx-community/whisper-tiny  --input < /path/to/audio.wav > --language en

-m asr : Sets the model type to ASR.
--input : Specifies the input audio file.
--language : Sets the language code (e.g., en for English, zh for Chinese).

TTS

Currently, TTS is only supported on macOS using the mlx runtime.

Use TTS models to convert input text into spoken audio.

bash

nexa infer nexaml/Kokoro-82M-bf16-MLX -m tts --voice-identifier zm_yunyang -p "Hello world this is a text to speech test" -o < /path/to/audio.wav >

-m TTS : Sets the model type to TTS.
--voice-identifier: Specifies the speaker’s voice.
When no --voice-identifier is provided, NexaCLI will return a full list of supported voices in the error message. This is useful for discovering all available voice options.
-p: The text prompt to synthesize.
-o: Output file for the generated .wav audio.

Embedder

Generate embeddings for multiple pieces of text using an embedding model.

bash

nexa infer djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF -m embedder --prompt "translate to text" --prompt "second"

-m embedder : Sets the model type to Embedder.
--prompt : Provide one or more pieces of text to embed.

Reranker

Use a reranker model to score and sort documents based on relevance to a query.

bash

nexa infer pqnet/bge-reranker-v2-m3-Q8_0-GGUF -m reranker --query "query" --document "a" --document "query"

-m reranker : Sets the model type to Reranker.
--query : The main query string used to evaluate document relevance.
--document : One or more documents to score against the query.

nexa serve

Launch the Nexa inference server using REST API.

bash

nexa serve -h

Show help menu for nexa serve.

Start serve

For example, start a local inference server bound to 127.0.0.1:8080. The server supports OpenAI-compatible APIs, and —keepalive 600 keeps models in memory for 10 minutes between requests.

bash

nexa serve --host 127.0.0.1:8080 --keepalive 600

You can test it via:

curl -X POST http://127.0.0.1:8080/v1/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/Qwen3-1.7B-GGUF\", \"prompt\": \"What is Nexa?\", \"max_tokens\": 100}"

This will send a POST request to the local /v1/completions endpoint to prompt the Qwen/Qwen3-0.6B-GGUF model with “What is Nexa?” and return a response with up to 100 tokens.

nexa run

Connect to a running Nexa server (via OpenAI-compatible API) and start a chat interface. You should start server first.

bash

nexa run -h

Show help menu for nexa run.

Run model

For example: launch an interactive streaming chat session with the Qwen/Qwen3-0.6B-GGUF model. The model generates and displays output incrementally as tokens are produced.

bash

nexa run ggml-org/Qwen3-1.7B-GGUF

--disable-stream|-s: disable streaming and respond the entire json back.

Was this page helpful?

Yes

Get Started

Usage

nexa pull

General Behavior

LLM

VLM

AudioLM

ASR

TTS

Embedder

Reranker

nexa list

nexa remove

nexa clean

nexa infer

Helper menu

Flags

LLM

VLM

AudioLM

ASR

TTS

Embedder

Reranker

nexa serve

Helper menu

Start serve

nexa run

Helper menu

Run model

Get Started

Usage

​nexa pull

​General Behavior

​LLM

​VLM

​AudioLM

​ASR

​TTS

​Embedder

​Reranker

​nexa list

​nexa remove

​nexa clean

​nexa infer

​Helper menu

​Flags

​LLM

​VLM

​AudioLM

​ASR

​TTS

​Embedder

​Reranker

​nexa serve

​Helper menu

​Start serve

​nexa run

​Helper menu

​Run model

nexa pull

General Behavior

LLM

VLM

AudioLM

ASR

TTS

Embedder

Reranker

nexa list

nexa remove

nexa clean

nexa infer

Helper menu

Flags

LLM

VLM

AudioLM

ASR

TTS

Embedder

Reranker

nexa serve

Helper menu

Start serve

nexa run

Helper menu

Run model