> ## Documentation Index
> Fetch the complete documentation index at: https://docs.nexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# REST API

> Local OpenAI-compatible API for text generation, embeddings, and more

## **Getting Started**

If you haven't already, pull a model before making requests. For example, pull Qwen3:

```bash bash theme={"dark"}
nexa pull NexaAI/Qwen3-0.6B-GGUF
```

To use the API, first open a terminal from the project root. Then, start the Nexa server by:

```bash bash theme={"dark"}
nexa serve
```

The server runs on `http://127.0.0.1:18181` by default. \
Keep the terminal that runs the server open, and make your requests from another terminal tab. \
To see a full list of configurable options for the server, run `nexa serve -h`

While you can try out nexa server with any HTTP tools, the easiest way to quickly get start is to run

```
nexa run NexaAI/Qwen3-0.6B-GGUF
```

You may replace the model name with any model names that has been pulled with `nexa pull`. The `run` command will also starts an REPL conversation UI just like `nexa infer`, but fulfilling your chat by sending requests to the server hosted by your `nexa serve` command.

## Model Choice

Certain models can only be run on specific platforms. For example, MLX models can only be run on MacOS 13+ devices. OmniNeural can only be run on a Qualcomm laptop with NPU. Below is a table that contains example models for each OS for you to try:

| OS                         | Modality         | Recommended Model                   |
| -------------------------- | ---------------- | ----------------------------------- |
| **macOS**                  | LLM              | NexaAI/gpt-oss-20b-MLX-4bit         |
| **macOS**                  | VLM              | NexaAI/gemma-3n-E4B-it-4bit-MLX     |
| **macOS**                  | Image Generation | NexaAI/sdxl-turbo                   |
| **macOS**                  | ASR              | NexaAI/whisper-large-v3-turbo-MLX   |
| **macOS**                  | TTS              | NexaAI/Kokoro-82M-bf16-MLX          |
| **Windows x86**            | LLM              | NexaAI/Qwen3-4B-GGUF                |
| **Windows x86**            | VLM              | NexaAI/gemma-3n                     |
| **Windows x86**            | Image Generation | NexaAI/Prefect-illustrious-XL-v2.0p |
| **Windows Qualcomm ARM64** | LLM              | NexaAI/Qwen3-4B-npu                 |
| **Windows Qualcomm ARM64** | VLM              | NexaAI/OmniNeural-4B                |
| **Windows Qualcomm ARM64** | ASR              | NexaAI/parakeet-tdt-0.6b-v3-npu     |
| **Windows AMD NPU**        | Image Generation | NexaAI/sdxl-turbo-amd-npu           |
| **Windows Intel NPU**      | LLM              | NexaAI/llama-3.1-8B-intel-npu       |

## **/v1/chat/completions**

Creates a model response for a given conversation. Supports LLM(text-only) and VLM (image+text).

### **Use LLM**

#### **Request body**

```json Example Value theme={"dark"}
{
  "model": "NexaAI/Qwen3-0.6B-GGUF",
  "messages": [
    {"role": "user", "content": "Hello! Briefly introduce yourself."}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}
```

#### **Usage Example**

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/Qwen3-0.6B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 64}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "NexaAI/Qwen3-0.6B-GGUF",
      "messages": [{"role":"user","content":"Hello!"}],
      "max_tokens": 64
    }'
  ```
</CodeGroup>

### **Use VLM**

#### **Request body**

`image_url` could one of the following:

* A remote URL, e.g., `https://example.com/photo.jpg`
* A base64-encoded string, e.g., `data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD...`
* A local file path on server machine, which requires the server to have access to the file. e.g., `file:///C:/Users/Username/Pictures/photo.jpg` (Windows) or `file:///Users/Username/Pictures/photo.jpg` (MacOS/Linux), `file://` is optional, you can just provide the path like `C:/Users/Username/Pictures/photo.jpg` or `/Users/Username/Pictures/photo.jpg`.

```json Example Value theme={"dark"}
{
  "model": "NexaAI/qwen3vl-GGUF",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image succinctly."},
        {"type": "image_url", "image_url": {"url": "</path/to/image>"}}
      ]
    }
  ]
}
```

#### **Usage Example**

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/qwen3vl-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"what is main color of the picture\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"</path/to/image>\"}}]}], \"stream\": false}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NexaAI/qwen3vl-GGUF",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "what is main color of the picture"
          },
          {
            "type": "image_url",
            "image_url": {"url": "</path/to/image>"}
          }
        ]
      }
    ],
    "stream": false
  }'
  ```
</CodeGroup>

## **/v1/images/generations**

Creates an image based on given a prompt.

<Note>
  The example below uses `NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda` as the model, which is recommended for most CUDA (NVIDIA GPU) environments.

  If you are running on **Apple Silicon**, use an MLX-compatible model (e.g., `nexaml/sdxl-turbo-ryzen-ai`).

  Always make sure the model you select matches your hardware capabilities.
</Note>

### **Request body**

```json Example Value theme={"dark"}
{
  "model": "NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda",
  "prompt": "A white cat with blue eyes",
  "n": 1,
  "size": "512x512",
  "response_format": "url"
}
```

### **Usage Example**

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/images/generations -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda\",\"prompt\":\"A white cat with blue eyes\",\"n\":1,\"size\":\"512x512\",\"response_format\":\"url\"}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/images/generations \
    -H "Content-Type: application/json" \
    -d '{
      "model": "stabilityai/sdxl-turbo",
      "prompt": "A white cat with blue eyes",
      "n": 1,
      "size": "512x512",
      "response_format": "url"
    }'
  ```
</CodeGroup>

## **/v1/embeddings**

Creates an embedding for the given input. Use this to convert text (or document chunks) to vectors for indexing.

<Note>
  Use this endpoint when you need to **convert text or document chunks into vectors** for indexing in a retrieval system.

  Make sure you select a model that supports **embeddings** (e.g., `djuna/jina-embeddings-*`). Calling this API with a non-embedding model will result in an error.
</Note>

### **Minimal request body**

```json Example Value theme={"dark"}
{
  "model": "djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF",
  "input": "Hello, world!"
}
```

### **Usage Example**

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/embeddings -H "Content-Type: application/json" -d "{\"model\":\"djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF\",\"input\":\"Hello, world!\"}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
      "model": "djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF",
      "input": "Hello, world!"
    }'
  ```
</CodeGroup>

## **/v1/reranking**

Rerank documents based on their relevance to a query. Returns a list of relevance scores aligned with the input order (higher = more relevant).

<Note>
  Use this endpoint **after a coarse retrieval step** (e.g., embeddings Top-K) to improve final ranking quality.

  Ensure the selected model supports reranking. Calling this API with a non-reranking model will result in an error.
</Note>

### **Minimal request body**

```json Example Value theme={"dark"}
{
  "model": "NexaAI/jina-v2-rerank-npu",
  "query": "What is machine learning?",
  "documents": [
    "Machine learning is a subset of artificial intelligence.",
    "Machine learning algorithms learn patterns from data.",
    "The weather is sunny today.",
    "Deep learning is a type of machine learning."
  ],
  "batch_size": 4,
  "normalize": true,
  "normalize_method": "softmax"
}
```

### **Usage Example**

<CodeGroup>
  ```bash Windows (cmd) theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/reranking -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/jina-v2-rerank-npu\",\"query\":\"What is machine learning?\",\"documents\":[\"Machine learning is a subset of artificial intelligence.\",\"Machine learning algorithms learn patterns from data.\",\"The weather is sunny today.\",\"Deep learning is a type of machine learning.\"],\"batch_size\":4,\"normalize\":true,\"normalize_method\":\"softmax\"}"
  ```

  ```bash MacOS theme={"dark"}
  curl -X POST http://127.0.0.1:18181/v1/reranking \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NexaAI/jina-v2-rerank-npu",
    "query": "What is machine learning?",
    "documents": [
      "Machine learning is a subset of artificial intelligence.",
      "Machine learning algorithms learn patterns from data.",
      "The weather is sunny today.",
      "Deep learning is a type of machine learning."
    ],
    "batch_size": 4,
    "normalize": true,
    "normalize_method": "softmax"
  }'
  ```
</CodeGroup>

<br />

<div class="feedback-wrapper">
  <span class="feedback-label">Was this page helpful?</span>

  <div class="feedback-toggle">
    <input type="radio" name="feedback" id="feedback-yes" class="feedback-input" />

    <label for="feedback-yes" class="feedback-button">
      <img src="https://mintcdn.com/nexaai/g8-zBYnunEyVtcK3/Images/FeedBack/thumbs-up.svg?fit=max&auto=format&n=g8-zBYnunEyVtcK3&q=85&s=0b57c51c8db9940403e7552956e5c30e" alt="Thumbs up" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-up.svg" />

      Yes
    </label>

    <input type="radio" name="feedback" id="feedback-no" class="feedback-input" />

    <label for="feedback-no" class="feedback-button">
      <img src="https://mintcdn.com/nexaai/g8-zBYnunEyVtcK3/Images/FeedBack/thumbs-down.svg?fit=max&auto=format&n=g8-zBYnunEyVtcK3&q=85&s=ebacf61d57c8259c6df243d329b548b3" alt="Thumbs down" class="feedback-icon" noZoom width="14" height="14" data-path="Images/FeedBack/thumbs-down.svg" />

      No
    </label>
  </div>
</div>
