API Reference

Model Name Mapping

For all CoreML(ANE) models, we use an internal naming mapping and you should fill in the plugin id accordingly. For GGUF format models (running on CPU/GPU), you do not need to provide the plugin id or model name—the plugin parameter is not required.

Model Name	Plugin ID	Huggingface repository name
NexaAI/EmbedNeural-ANE	ane	NexaAI/EmbedNeural-ANE
NexaAI/parakeet-tdt-0.6b-v3-ane	ane	NexaAI/parakeet-tdt-0.6b-v3-ane

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

import NexaSdk

// Load model
let asr = try Asr()
try await asr.load(from: "<path/to/model/dir>")

// Transcribe audio file
let result = try await asr.transcribe(options: .init(audioPath: "<your-audio-path>"))
print(result.asrResult)

ASR stream mode

import NexaSdk

// Load model
let asr = try Asr()
try await asr.load(from: "<path/to/model/dir>")

do {
    // start record and stream
    let stream = try asr.startRecordingStream()
    for try await content in stream {
        print(content)
    }
    // stop stream
    asr.stopRecordingStream()
} catch {
    asr.stopRecordingStream()
}

Core Methods

func load(from repoFolder: URL) async throws

Loads an ASR model from a HuggingFace-format local repository folder
Parameters:
- repoFolder: The folder containing the HuggingFace model files
Returns: None
Throws: Error if the model fails to load
Note: This is an async function and must be awaited

func startRecordingStream(config: ASRStreamConfig = .init(), block tapBlock: AVAudioNodeTapBlock? = nil)

Starts audio recording and ASR streaming simultaneously.
Parameters:
- config: Streaming configuration.
- tapBlock: Custom tap block to inspect or process audio samples.
Returns: none

func stopRecordingStream()

Stops both audio recording and ASR streaming.
Returns: None

func startRecording(block tapBlock: AVAudioNodeTapBlock? = nil)

Starts audio recording only.
Parameters:
- tapBlock: Optional tap block to inspect or process audio.
Throws: Error if audio session or engine fails to start.
Returns: None

func stopRecording()

Stops the current audio recording session.
Returns: None

func startStream(config: ASRStreamConfig = .init()) throws -> AsyncThrowingStream<String, Error>

Starts ASR streaming mode.
Parameters:
- config: Streaming configuration.
Returns: A stream that yields partial or final transcription text.

func stopStream(graceful: Bool = true)

Stops ASR streaming.
Parameters:
- graceful:
  - true: Process remaining buffered audio before stopping (default).
  - false: Stop immediately.
Returns: None

func streamPushSamples(samples: [Float]) throws

Pushes raw audio samples into the streaming ASR pipeline for processing
Parameters:
- samples: An array of PCM audio samples
Returns: None
Throws: Error if the streaming session is not active or the audio buffer cannot be processed

AsrResult

/// ASR transcription result
public struct AsrResult: Codable {
    /// Transcribed text
    public let transcript: String
    /// Confidence scores for each unit
    public let confidences: [Float]
    /// Timestamp pairs: [start, end] for each unit
    public let timestamps: [Float]
}

AsrResponse

public struct AsrResponse: Codable {
    public let asrResult: AsrResult
    public let profileData: ProfileData?
}

AsrOptions

public struct AsrOptions: Codable {
    public let modelPath: String
    public let language: Language
}

public enum Language: String, Codable {
    case en
    case ch
}

ASR streaming configuration

public struct ASRStreamConfig {
    // Timestamp mode
    public enum TimestampMode: String {
        case segment
        case word
        case none
    }

    // language
    public var language: Language = .en
    // Duration in seconds for each chunk (default: 4.0)
    public var chunkDuration: Float
    // Overlap between chunks in seconds (default: 3.0)
    public var overlapDuration: Float
    // Audio sample rate (default: 16000)
    public var sampleRate: Int32
    // Maximum chunks in processing queue (default: 10)
    public var maxQueueSize: Int32
    // Audio buffer size for input (default: 512)
    public var bufferSize: Int32
    // Timestamp mode: "none", "segment", "word" (default: none)
    public var timestamps: TimestampMode
    // Beam search size (default: 5)
    public var beamSize: Int32
}

Embeddings Usage

Generate vector embeddings for semantic search and RAG applications.

Basic Usage

import NexaSdk
import Foundation

// Load embedder for ANE model
let repoDir: URL = URL(string: "path/to/model/dir")!
let embedder = try Embedder(from: repoDir, plugin: .ane)

// Generate embeddings for multiple texts
let texts = ["<your-text1>", "<your-text2>"]
let result = try embedder.embed(texts: texts, config: .init(batchSize: Int32(texts.count)))
for (i, vec) in result.embeddings.enumerated() {
    let head = vec.prefix(10)
    print("[\(i)]", Array(head))
}

API Reference

Core Methods

convenience init(from repoFolder: URL, plugin: Plugin = .cpu_gpu)

Initializes an instance using a model stored in a local repository folder
Parameters:
- repoFolder: Path to the local model repository folder
- plugin: Backend plugin to use (cpu_gpu by default)
Returns: An initialized instance
Throws: Error if model loading or initialization fails

func embed(inputIds: [[Int32]], config: EmbeddingConfig) throws -> EmbedResult

Generates embeddings from pre-tokenized input IDs
Parameters:
- inputIds: Array of tokenized sequences, each inner array is the token IDs for one sample
- config: Embedding configuration
Returns: EmbedResult
Note: Supported only on the cpu_gpu plugin

func embed(texts: [String], config: EmbeddingConfig) throws -> EmbedResult

Generates embeddings for input text strings
Parameters:
- texts: Array of input texts to embed
- config: Embedding process configuration (batch size, normalization, etc.)
Returns: EmbedResult containing embeddings and profiling data

func embed(imagePaths: [String], config: EmbeddingConfig) throws -> EmbedResult

Generates embeddings for input images
Parameters:
- imagePaths: Paths to input images
- config: Embedding configuration
Returns: EmbedResult

func dim() throws -> Int32

Returns the embedding dimension for the model
Parameters: None
Returns: Int32 representing the embedding dimension

EmbeddingConfig

public struct EmbeddingConfig {
    /// Processing batch size
    public var batchSize: Int32

    /// Normalization: "l2", "mean"
    /// `nil` means not normalize embeddings
    public var normalizeMethod: NormalizeMethod?
}

EmbedResult

public struct EmbedResult {
    public let embeddings: [[Float]]
    public let profileData: ProfileData
}

LLM Usage

Large Language Models for text generation and chat applications.

Streaming Conversation - CPU/GPU

We support CPU/GPU inference for GGUF format models.

import NexaSdk

let llm = try LLM()
// load from exact gguf path
try await llm.load(.init(modelPath: "<path/to/model/file>"))

let system = "You are a helpful AI assistant"
let userMsgs = [
    "Tell me a long story, about 100 words",
    "How are you"
]
var messages = [ChatMessage]()
messages.append(.init(role: .system, content: system))
for userMsg in userMsgs {
    messages.append(.init(role: .user, content: userMsg))
    // generation
    let stream = try await llm.generateAsyncStream(messages: messages)
    var response = ""
    for try await token in stream {
        print(token, terminator: "")
        response += token
    }
    messages.append(.init(role: .assistant, content: response))
    print("\n\n")
}

Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

Streaming Conversation - CPU/GPU

We support CPU/GPU inference for GGUF format models.

import NexaSdk

let vlm = try VLM()
try await vlm.load(.init(modelPath: "<path/to/model/file>", mmprojPath: "<path/to/mmproj/file>"))

let images = ["<path/to/your/image>"]
var config = GenerationConfig.default
config.imagePaths = images
let message = ChatMessage(role: .user, content: "What do you see in this image", images: images)
let stream = try await vlm.generateAsyncStream(messages: [message], options: .init(config: config))

for try await token in stream {
    print(token, terminator: "")
}
print()

API Reference

Core Methods

func load(_ options: ModelOptions) async throws

Loads the model with the specified configuration
Parameters:
- options: Model loading options
Returns: None
Throws: Error if the model fails to load

func load(from repoFolder: URL, modelFileName: String = "", mmprojFileName: String = "") throws

Loads the model from a local HuggingFace repository directory
Parameters:
- repoFolder: Local HuggingFace repository directory
- modelFileName: Model file name; if empty, the default is used
- mmprojFileName: mmproj file name; if empty, the default is used
Returns: None
Throws: Error if model loading fails

func applyChatTemplate(messages: [ChatMessage], options: ChatTemplateOptions) async throws -> String

Applies the model’s chat template and formats messages accordingly
Parameters:
- messages: Messages to format
- options: Template configuration options
Returns: String representing the formatted prompt
Throws: Error if the template cannot be applied

func generateAsyncStream(messages: [ChatMessage], options: GenerationOptions) async throws -> AsyncThrowingStream<String, any Error>

Generates text in streaming fashion from chat messages
Parameters:
- messages: Chat history used for generation
- options: Generation configuration
Returns: yielding generated text tokens
Throws: Error if generation fails to start

func generate(prompt: String, config: GenerationConfig) async throws -> GenerateResult

Generates text for a single prompt
Parameters:
- prompt: Input prompt
- config: Generation configuration
Returns: GenerateResult
Throws: Error if generation fails

func reset()

Resets the model’s internal state
Parameters: None
Returns: None

func stopStream()

Stops generation streaming session
Parameters: None
Returns: None

func saveKVCache(to path: String)

Saves the current KV cache to the specified file path
Parameters:
- path (String): Destination file path
Returns: None
Notes: Only available for LLM models

func loadKVCache(from path: String)

Loads KV cache from the specified file path
Parameters:
- path (String): Source file path
Returns: None
Notes: Only available for LLM models

GenerationConfig

public struct GenerationConfig {
    // Maximum tokens to generate
    public var maxTokens: Int32 
    // Array of stop sequences                
    public var stop: [String]  
    // Number of past tokens to consider                
    public var nPast: Int32           
    // Advanced sampling config          
    public var samplerConfig: SamplerConfig 
    // Array of image paths for VLM, empty for none   
    public var imagePaths: [String] 
    // Array of audio paths for VLM, empty for none          
    public var audioPaths: [String]            
}

SamplerConfig

public struct SamplerConfig {
    // Sampling temperature (0.0-2.0)
    public var temperature: Float
    // Nucleus sampling parameter (0.0-1.0)
    public var topP: Float
    // Top-k sampling parameter
    public var topK: Int32
    // Minimum probability for nucleus sampling
    public var minP: Float
    // Penalty for repeated tokens
    public var repetitionPenalty: Float
    // Penalty for token presence
    public var presencePenalty: Float
    // Penalty for token frequency
    public var frequencyPenalty: Float
    // Random seed (-1 for random)
    public var seed: Int32
    // Optional grammar file path
    public var grammarPath: String?
 }

ProfileData

public struct ProfileData: CustomStringConvertible, Codable {

    /// Time to first token (us)
    public let ttft: Int64

    /// Prompt processing time (us)
    public let promptTime: Int64

    /// Token generation time (us)
    public let decodeTime: Int64

    /// Number of prompt tokens
    public let promptTokens: Int64

    /// Number of generated tokens
    public let generatedTokens: Int64

    /// Audio duration (us)
    public let audioDuration: Int64

    /// Prefill speed (tokens/sec)
    public let prefillSpeed: Double

    /// Decoding speed (tokens/sec)
    public let decodingSpeed: Double

    /// Real-Time Factor (RTF) (1.0 = real-time, >1.0 = faster, <1.0 = slower)
    public let realTimeFactor: Double

    /// Stop reason: "eos", "length", "user", "stop_sequence"
    public let stopReason: String
}

Rerank Usage

Improve search relevance by reranking documents based on query relevance.

Basic Usage

import NexaSdk
import Foundation

// Load reranker model for CPU/GPU inference
let repoDir = URL(string: "<path/to/model/dir>")!
let reranker = try Reranker(from: repoDir)
let query = "What is machine learning?"
let documents = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions ",
    "without being explicitly programmed.",
    "Machine learning algorithms build mathematical models based on sample data to make predictions or decisions.",
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers.",
    "Python is a popular programming language for machine learning and data science.",
    "The weather today is sunny and warm."
]
let result = try await reranker.rerank(query, documents: documents)
print(result.scores)

API Reference

Core Methods

init(modelPath: String, tokenizerPath: String? = nil, deviceId: String? = nil, plugin: Plugin = .cpu_gpu) throws

Initializes a reranker model from local file paths
Parameters:
- modelPath: Path to the reranker model
- tokenizerPath: Path to tokenizer
- deviceId: Device identifier; if nil, uses default backend setup
- plugin: Backend plugin (default cpu_gpu)
Returns: Instance of the reranker
Throws: Error if initialization fails

func rerank(_ query: String, documents: [String], config: RerankConfig = .init()) async throws -> RerankerResult

Performs document ranking given a query and a list of documents
Parameters:
- query: Query string to evaluate
- documents: List of documents to rank
- config: Reranking configuration
Returns: RerankerResult
Throws: Error during reranking execution

convenience init(from repoFolder: URL, plugin: Plugin = .cpu_gpu) throws

Convenience initializer that loads a reranker from a HuggingFace-style local repository
Parameters:
- repoFolder: Local repository directory
- plugin: Backend plugin (default cpu_gpu)
Returns: Instance of the reranker
Throws: Error if loading fails

RerankerResult

public struct RerankerResult {
    public let scores: [Float]
    public let profileData: ProfileData
}

RerankConfig

public struct RerankConfig {
    /// Processing batch size
    public var batchSize: Int32

    /// `nil` means not normalize scores
    /// Normalization: "softmax", "min-max", "l2"
    public var normalizeMethod: NormalizeMethod?

    public enum NormalizeMethod: String, CaseIterable {
        case softmax
        case minMax = "min-max"
        case l2
    }
}

How to use CPU/GPU, ANE

Currently, two hardware acceleration mode are offered by Nexa iOS / macOS SDK - CPU/GPU or ANE NPU acceleration. A model can only be run on a certain hardware acceleration. Please carefully read the model card on huggingface to correctly use them.

CPU/GPU Mode

// CPU/GPU embedder - uses GGUF format models
let embedder = try Embedder(from: "<path/to/model/dir>", plugin: .cpu_gpu)

ANE Mode

// ANE embedder - uses CoreML format models
let embedder = try Embedder(from: "<path/to/model/dir>", plugin: .ane)

The Embedder modality supports both CPU/GPU and ANE execution. LLM, VLM, and Reranker modules support CPU/GPU only. The ASR module runs exclusively on ANE and does not support CPU/GPU execution.

Need Help?

Join our community to get support, share your projects, and connect with other developers.

Discord Community

Get real-time support and chat with the Nexa AI community

Slack Community

Collaborate with developers and access community resources

Was this page helpful?

Yes

Get Started

Nexa CLI Usage

Android SDK

Linux Docker

Python Library

iOS & macOS SDK

Community

​Model Name Mapping

​ASR Usage

​Basic Usage

​ASR stream mode

​API Reference

​Core Methods

​AsrResult

​AsrResponse

​AsrOptions

​ASR streaming configuration

​Embeddings Usage

​Basic Usage

​API Reference

​Core Methods

​EmbeddingConfig

​EmbedResult

​LLM Usage

​Streaming Conversation - CPU/GPU

​Multimodal Usage

​Streaming Conversation - CPU/GPU

​API Reference

​Core Methods

​GenerationConfig

​SamplerConfig

​ProfileData

​Rerank Usage

​Basic Usage

​API Reference

​Core Methods

​RerankerResult

​RerankConfig

​How to use CPU/GPU, ANE

​CPU/GPU Mode

​ANE Mode

​Need Help?

Discord Community

Slack Community

Model Name Mapping

ASR Usage

Basic Usage

ASR stream mode

API Reference

Core Methods

AsrResult

AsrResponse

AsrOptions

ASR streaming configuration

Embeddings Usage

Basic Usage

API Reference

Core Methods

EmbeddingConfig

EmbedResult

LLM Usage

Streaming Conversation - CPU/GPU

Multimodal Usage

Streaming Conversation - CPU/GPU

API Reference

Core Methods

GenerationConfig

SamplerConfig

ProfileData

Rerank Usage

Basic Usage

API Reference

Core Methods

RerankerResult

RerankConfig

How to use CPU/GPU, ANE

CPU/GPU Mode

ANE Mode

Need Help?