> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# CLI Reference

> This page documents all available CLI commands with usage examples.

<Note>Run nexa commands from the nexa executable directory.</Note>

## **nexa pull**

Download a model and store it locally.
After entering the pull command, you will be guided through a setup process to choose the model type, main model file, tokenizer (optional), and extra files (optional).

### **General Behavior**

After running `nexa pull <model-name>`, the CLI will prompt:

1. **Quant version selection**\
   If the model supports multiple quantized versions, you will see a menu like this:

   ```bash Quant version selection theme={"dark"}
   Choose a quant version to download
   > Q4_K_M     [1.2 GiB] (default)
     Q8_0       [2.0 GiB]
     F16        [3.8 GiB]
   ```

   Select a quant version you prefer.

2. **Download begins**\
   After selection, the model files will start downloading automatically.

### **LLM**

```bash bash theme={"dark"}
nexa pull NexaAI/Qwen3-0.6B-GGUF
```

### **VLM**

```bash bash theme={"dark"}
nexa pull NexaAI/Qwen2.5-Omni-3B-GGUF
```

### **Function Call**

```bash bash theme={"dark"}
nexa pull NexaAI/Qwen3-0.6B-GGUF
```

### **Omni Model**

```bash bash theme={"dark"}
nexa pull NexaAI/Qwen2.5-Omni-3B-GGUF
```

### **ASR**

```bash bash theme={"dark"}
nexa pull mlx-community/whisper-tiny
```

### **TTS**

```bash bash theme={"dark"}
nexa pull nexaml/Kokoro-82M-bf16-MLX
```

### **Embedder**

```bash bash theme={"dark"}
nexa pull djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF
```

### **Reranker**

```bash bash theme={"dark"}
nexa pull pqnet/bge-reranker-v2-m3-Q8_0-GGUF
```

## **nexa list**

Display all downloaded models in a table with their names and sizes.

```bash bash theme={"dark"}
nexa list
```

## **nexa remove**

Remove a specific local model by name.

For example, remove the locally downloaded model NexaAI/Qwen3-0.6B-GGUF from the cache directory. This will free up disk space and make the model unavailable for future inference unless re-downloaded.

```bash bash theme={"dark"}
nexa remove NexaAI/Qwen3-0.6B-GGUF
```

## **nexa clean**

Delete all locally cached models.

```bash bash theme={"dark"}
nexa clean
```

## **nexa infer**

Run inference with a specified model. The model must be downloaded and cached locally.

### **Helper menu**

```bash bash theme={"dark"}
nexa infer -h
```

Show help menu for `nexa infer`.

### **LLM**

Launch an interactive chat session with the language model.

```bash bash theme={"dark"}
nexa infer NexaAI/Qwen3-0.6B-GGUF
```

Use the `--think` option to control whether the model outputs its internal reasoning process.

* `--think=false` : The model responds directly without showing reasoning.
* `--think=true` : The model displays its reasoning steps before the final response.

Example with reasoning enabled:

```bash bash theme={"dark"}
nexa infer NexaAI/Qwen3-0.6B-GGUF --think=true
```

### **VLM**

Text-only or response from image file (interactive image input):

```bash bash theme={"dark"}
nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF
```

If you only want text input, simply launch the command and begin chatting.\
If you'd like the model to response from an image, provide the **absolute path** to an image at the end of your message.\
Example prompt: `Describe this picture </path/to/image.png>`

### **Omni Models**

Text-only or response from audio file (interactive audio output):

```bash bash theme={"dark"}
nexa infer ggml-org/Qwen2.5-Omni-3B-GGUF
```

If you only want text output, start chatting as usual.<br />
If you'd like the model to response from an audio, provide an **absolute path** to an audio at the end of your message.\
Example prompt: `Convert this audio into text </path/to/audio.mp3>`

### **ASR**

<Note>Currently, ASR is only supported on macOS using the mlx runtime.</Note>

Use ASR models to transcribe speech from audio files into text.

```bash bash theme={"dark"}
nexa infer -m asr mlx-community/whisper-tiny  --input < /path/to/audio.wav > --language en
```

* `-m asr` : Sets the model type to ASR.
* `--input` : Specifies the input audio file.
* `--language` : Sets the language code (e.g., en for English, zh for Chinese).

### **TTS**

<Note>Currently, TTS is only supported on macOS using the mlx runtime.</Note>

Use TTS models to convert input text into spoken audio.

```bash bash theme={"dark"}
nexa infer nexaml/Kokoro-82M-bf16-MLX -m tts --voice-identifier zm_yunyang -p "Hello world this is a text to speech test" -o < /path/to/audio.wav >
```

* `-m TTS` : Sets the model type to TTS.
* `--voice-identifier`: Specifies the speaker's voice.
  <Tip>When no `--voice-identifier` is provided, NexaCLI will return a full list of supported voices in the error message. This is useful for discovering all available voice options.</Tip>
* `-p`: The text prompt to synthesize.
* `-o`: Output file for the generated .wav audio.

### **Embedder**

Generate embeddings for multiple pieces of text using an embedding model.

```bash bash theme={"dark"}
nexa infer djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF -m embedder --prompt "translate to text" --prompt "second"
```

* `-m embedder` : Sets the model type to Embedder.
* `--prompt` : Provide one or more pieces of text to embed.

### **Reranker**

Use a reranker model to score and sort documents based on relevance to a query.

```bash bash theme={"dark"}
nexa infer pqnet/bge-reranker-v2-m3-Q8_0-GGUF -m reranker --query "query" --document "a" --document "query"
```

* `-m reranker` : Sets the model type to Reranker.
* `--query` : The main query string used to evaluate document relevance.
* `--document` : 	One or more documents to score against the query.

## **nexa serve**

Launch the Nexa inference server using REST API.

### **Helper menu**

```bash bash theme={"dark"}
nexa serve -h
```

Show help menu for `nexa serve`.

### **Start serve**

For example, start a local inference server bound to 127.0.0.1:8080. The server supports OpenAI-compatible APIs, and --keepalive 600 keeps models in memory for 10 minutes between requests.

```bash bash theme={"dark"}
nexa serve --host 127.0.0.1:8080 --keepalive 600
```

You can test it via:

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/Qwen3-0.6B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 64}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NexaAI/Qwen3-0.6B-GGUF",
    "messages": [{"role":"user","content":"Hello!"}],
    "max_tokens": 64
  }'
  ```
</CodeGroup>

This will send a POST request to the local /v1/completions endpoint to prompt the NexaAI/Qwen3-0.6B-GGUF model with “What is the capital of France?” and return a response with up to 100 tokens.

## **nexa run**

Connect to a running Nexa server (via OpenAI-compatible API) and start a chat interface. You should start server first.

### **Helper menu**

```bash bash theme={"dark"}
nexa run -h
```

Show help menu for `nexa run`.

### **Run model**

For example: launch an interactive streaming chat session with the NexaAI/Qwen3-0.6B-GGUF model. The model generates and displays output incrementally as tokens are produced.

```bash bash theme={"dark"}
nexa run NexaAI/Qwen3-0.6B-GGUF
```

`--disable-stream|-s`: disable streaming and respond the entire json back.

<br />

<div class="feedback-wrapper">
  <span class="feedback-label">Was this page helpful?</span>

  <div class="feedback-toggle">
    <input type="radio" name="feedback" id="feedback-yes" class="feedback-input" />

    <label for="feedback-yes" class="feedback-button">
      <img src="https://mintcdn.com/nexaai/g8-zBYnunEyVtcK3/Images/FeedBack/thumbs-up.svg?fit=max&auto=format&n=g8-zBYnunEyVtcK3&q=85&s=0b57c51c8db9940403e7552956e5c30e" alt="Thumbs up" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-up.svg" />

      Yes
    </label>

    <input type="radio" name="feedback" id="feedback-no" class="feedback-input" />

    <label for="feedback-no" class="feedback-button">
      <img src="https://mintcdn.com/nexaai/g8-zBYnunEyVtcK3/Images/FeedBack/thumbs-down.svg?fit=max&auto=format&n=g8-zBYnunEyVtcK3&q=85&s=ebacf61d57c8259c6df243d329b548b3" alt="Thumbs down" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-down.svg" />

      No
    </label>
  </div>
</div>
