> ## Documentation Index > Fetch the complete documentation index at: https://docs.nexa.ai/llms.txt > Use this file to discover all available pages before exploring further. # CLI Reference > This page documents all available CLI commands with usage examples. Run nexa commands from the nexa executable directory. ## **nexa pull** Download a model and store it locally. After entering the pull command, you will be guided through a setup process to choose the model type, main model file, tokenizer (optional), and extra files (optional). ### **General Behavior** After running `nexa pull `, the CLI will prompt: 1. **Quant version selection**\ If the model supports multiple quantized versions, you will see a menu like this: ```bash Quant version selection theme={"dark"} Choose a quant version to download > Q4_K_M [1.2 GiB] (default) Q8_0 [2.0 GiB] F16 [3.8 GiB] ``` Select a quant version you prefer. 2. **Download begins**\ After selection, the model files will start downloading automatically. ### **LLM** ```bash bash theme={"dark"} nexa pull NexaAI/Qwen3-0.6B-GGUF ``` ### **VLM** ```bash bash theme={"dark"} nexa pull NexaAI/Qwen2.5-Omni-3B-GGUF ``` ### **Function Call** ```bash bash theme={"dark"} nexa pull NexaAI/Qwen3-0.6B-GGUF ``` ### **Omni Model** ```bash bash theme={"dark"} nexa pull NexaAI/Qwen2.5-Omni-3B-GGUF ``` ### **ASR** ```bash bash theme={"dark"} nexa pull mlx-community/whisper-tiny ``` ### **TTS** ```bash bash theme={"dark"} nexa pull nexaml/Kokoro-82M-bf16-MLX ``` ### **Embedder** ```bash bash theme={"dark"} nexa pull djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF ``` ### **Reranker** ```bash bash theme={"dark"} nexa pull pqnet/bge-reranker-v2-m3-Q8_0-GGUF ``` ## **nexa list** Display all downloaded models in a table with their names and sizes. ```bash bash theme={"dark"} nexa list ``` ## **nexa remove** Remove a specific local model by name. For example, remove the locally downloaded model NexaAI/Qwen3-0.6B-GGUF from the cache directory. This will free up disk space and make the model unavailable for future inference unless re-downloaded. ```bash bash theme={"dark"} nexa remove NexaAI/Qwen3-0.6B-GGUF ``` ## **nexa clean** Delete all locally cached models. ```bash bash theme={"dark"} nexa clean ``` ## **nexa infer** Run inference with a specified model. The model must be downloaded and cached locally. ### **Helper menu** ```bash bash theme={"dark"} nexa infer -h ``` Show help menu for `nexa infer`. ### **LLM** Launch an interactive chat session with the language model. ```bash bash theme={"dark"} nexa infer NexaAI/Qwen3-0.6B-GGUF ``` Use the `--think` option to control whether the model outputs its internal reasoning process. * `--think=false` : The model responds directly without showing reasoning. * `--think=true` : The model displays its reasoning steps before the final response. Example with reasoning enabled: ```bash bash theme={"dark"} nexa infer NexaAI/Qwen3-0.6B-GGUF --think=true ``` ### **VLM** Text-only or response from image file (interactive image input): ```bash bash theme={"dark"} nexa infer NexaAI/Qwen2.5-Omni-3B-GGUF ``` If you only want text input, simply launch the command and begin chatting.\ If you'd like the model to response from an image, provide the **absolute path** to an image at the end of your message.\ Example prompt: `Describe this picture ` ### **Omni Models** Text-only or response from audio file (interactive audio output): ```bash bash theme={"dark"} nexa infer ggml-org/Qwen2.5-Omni-3B-GGUF ``` If you only want text output, start chatting as usual.
If you'd like the model to response from an audio, provide an **absolute path** to an audio at the end of your message.\ Example prompt: `Convert this audio into text ` ### **ASR** Currently, ASR is only supported on macOS using the mlx runtime. Use ASR models to transcribe speech from audio files into text. ```bash bash theme={"dark"} nexa infer -m asr mlx-community/whisper-tiny --input < /path/to/audio.wav > --language en ``` * `-m asr` : Sets the model type to ASR. * `--input` : Specifies the input audio file. * `--language` : Sets the language code (e.g., en for English, zh for Chinese). ### **TTS** Currently, TTS is only supported on macOS using the mlx runtime. Use TTS models to convert input text into spoken audio. ```bash bash theme={"dark"} nexa infer nexaml/Kokoro-82M-bf16-MLX -m tts --voice-identifier zm_yunyang -p "Hello world this is a text to speech test" -o < /path/to/audio.wav > ``` * `-m TTS` : Sets the model type to TTS. * `--voice-identifier`: Specifies the speaker's voice. When no `--voice-identifier` is provided, NexaCLI will return a full list of supported voices in the error message. This is useful for discovering all available voice options. * `-p`: The text prompt to synthesize. * `-o`: Output file for the generated .wav audio. ### **Embedder** Generate embeddings for multiple pieces of text using an embedding model. ```bash bash theme={"dark"} nexa infer djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF -m embedder --prompt "translate to text" --prompt "second" ``` * `-m embedder` : Sets the model type to Embedder. * `--prompt` : Provide one or more pieces of text to embed. ### **Reranker** Use a reranker model to score and sort documents based on relevance to a query. ```bash bash theme={"dark"} nexa infer pqnet/bge-reranker-v2-m3-Q8_0-GGUF -m reranker --query "query" --document "a" --document "query" ``` * `-m reranker` : Sets the model type to Reranker. * `--query` : The main query string used to evaluate document relevance. * `--document` : One or more documents to score against the query. ## **nexa serve** Launch the Nexa inference server using REST API. ### **Helper menu** ```bash bash theme={"dark"} nexa serve -h ``` Show help menu for `nexa serve`. ### **Start serve** For example, start a local inference server bound to 127.0.0.1:8080. The server supports OpenAI-compatible APIs, and --keepalive 600 keeps models in memory for 10 minutes between requests. ```bash bash theme={"dark"} nexa serve --host 127.0.0.1:8080 --keepalive 600 ``` You can test it via: ```bash Windows (cmd) theme={"dark"} curl -X POST http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/Qwen3-0.6B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 64}" ``` ```bash MacOS theme={"dark"} curl -X POST http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NexaAI/Qwen3-0.6B-GGUF", "messages": [{"role":"user","content":"Hello!"}], "max_tokens": 64 }' ``` This will send a POST request to the local /v1/completions endpoint to prompt the NexaAI/Qwen3-0.6B-GGUF model with “What is the capital of France?” and return a response with up to 100 tokens. ## **nexa run** Connect to a running Nexa server (via OpenAI-compatible API) and start a chat interface. You should start server first. ### **Helper menu** ```bash bash theme={"dark"} nexa run -h ``` Show help menu for `nexa run`. ### **Run model** For example: launch an interactive streaming chat session with the NexaAI/Qwen3-0.6B-GGUF model. The model generates and displays output incrementally as tokens are produced. ```bash bash theme={"dark"} nexa run NexaAI/Qwen3-0.6B-GGUF ``` `--disable-stream|-s`: disable streaming and respond the entire json back.

Was this page helpful?

Yes