Getting Started

If you haven’t already, pull a model before making requests. For example, pull Qwen3:
bash
nexa pull ggml-org/Qwen3-1.7B-GGUF
Follow the prompts:
  • Model type: llm
  • Select quant version: no quant selection
Then start the server To use the API, first start the Nexa server from the project root:
bash
nexa serve
The server runs on http://127.0.0.1:18181 by default.

API Endpoints

/v1/completions

Generate a single-turn text completion from a prompt.
  • endpoint : type of interaction, for example, http://127.0.0.1:18181/v1/completions
    • /completions
    • /chat/completions
    • /embeddings
    • /reranking
  • model-name : for example, ggml-org/Qwen3-0.6B-GGUF
Example Request:
curl -X POST http://127.0.0.1:18181/v1/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/Qwen3-1.7B-GGUF\", \"prompt\": \"Write a hello world program in Python\", \"max_tokens\": 150}"
This sends a prompt to the model asking for a Python hello world program and returns the full generated response.

/v1/chat/completions

Perform multi-turn conversation with role-based messages. Supports stream, multimodal input, and function tools. Example:
curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/Qwen3-1.7B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}], \"stream\": false}"
Sample Response:
json
{
  "id": "",
  "choices": [
    {
      "finish_reason": "",
      "index": 0,
      "logprobs": {
        "content": null,
        "refusal": null
      },
      "message": {
        "content": "<think>\nOkay, the user is asking for the capital of France. I need to make sure I recall the correct answer. France's capital is Paris. Let me think... Yes, Paris is the capital city. I should confirm that there isn't any other city that's considered the capital. For example, maybe some other city like Lyon or Marseille? No, those are provinces. Paris is the capital. I should also mention that it's a major city in the country. Let me check if there's any recent change, but I don't think so. The answer is straightforward. Just state Paris as the capital.\n</think>\n\nThe",
        "refusal": "",
        "role": "assistant",
        "annotations": null,
        "audio": {
          "id": "",
          "data": "",
          "expires_at": 0,
          "transcript": ""
        },
        "function_call": {
          "arguments": "",
          "name": ""
        },
        "tool_calls": null
      }
    }
  ],
  "created": 0,
  "model": "",
  "object": "chat.completion",
  "service_tier": "",
  "system_fingerprint": "",
  "usage": {
    "completion_tokens": 0,
    "prompt_tokens": 0,
    "total_tokens": 0,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  }
}
This simulates a user asking the model a question in chat format and receives a streamed or non-streamed response.

Function Calling

Enable models to call functions by providing tool definitions in your request. First, pull a model that supports function calling:
bash
nexa pull Qwen/Qwen2.5-1.5B-Instruct-GGUF
Follow the prompts:
  • Model type: llm
  • Select quant version: Select a quant version you prefer
Then send requests with tool definitions:
curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"Qwen/Qwen2.5-1.5B-Instruct-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"What is the weather like in Boston today?\"}], \"tools\": [{\"type\": \"function\", \"function\": {\"name\": \"get_current_weather\", \"description\": \"Get the current weather in a given location\", \"parameters\": {\"type\": \"object\", \"properties\": {\"location\": {\"type\": \"string\", \"description\": \"The city and state, e.g. San Francisco, CA\"}, \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]}}, \"required\": [\"location\"]}}}]}"

Using Image Models

For vision language models (VLMs), first pull a model with image support:
bash
nexa pull ggml-org/gemma-3-4b-it-GGUF
Follow the prompts:
  • Model type: vlm
  • Select quant version: Select a quant version you prefer
Then send requests with both text and image content. You need to update image path in the following code block.
curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/gemma-3-4b-it-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"what is main color of the picture\"}, {\"type\": \"image_url\", \"image_url\": \"</path/to/image/file>\"}]}], \"stream\": false}"

Using Audio Models

For audio language models, first pull a model with audio support:
bash
nexa pull ggml-org/Qwen2.5-Omni-3B-GGUF
Follow the prompts:
  • Model type: vlm
  • Select quant version: Select a quant version you prefer
Then send requests with both text and audio content. You need to update audio path in the following code block.
curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"ggml-org/Qwen2.5-Omni-3B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"translate to text\"}, {\"type\": \"input_audio\", \"input_audio\": \"< /path/to/audio >\"}]}], \"stream\": false}"

/v1/embeddings

Get vector representations of input texts using an embedding model. Prerequisites: Download an embedding model first:
bash
nexa pull djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF
Follow the prompts:
  • Model type: embedder
  • Select quant version: Select a quant version you prefer
Example Request:
curl -X POST http://127.0.0.1:18181/v1/embeddings -H "Content-Type: application/json" -d "{\"model\": \"djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF\", \"input\": [\"hello\", \"world\"]}"
This returns numerical embeddings (vector representations) for the input words “hello” and “world”, useful for similarity or retrieval tasks.

/v1/reranking

Given a query and candidate documents, return a relevance score for each. Prerequisites: Download a reranker model first:
bash
nexa pull pqnet/bge-reranker-v2-m3-Q8_0-GGUF
Follow the prompts:
  • Model type: reranker
  • Select quant version: Select a quant version you prefer
Example Request:
curl -X POST http://127.0.0.1:18181/v1/reranking -H "Content-Type: application/json" -d "{\"model\": \"pqnet/bge-reranker-v2-m3-Q8_0-GGUF\", \"query\": \"hi\", \"documents\": [\"hello\", \"world\"]}"
This evaluates which of the provided documents is most relevant to the query “hi” by returning a list of relevance scores.