Skip to main content

Model Name Mapping

For all CoreML(ANE) models, we use an internal naming mapping and you should fill in the plugin id accordingly. For GGUF format models (running on CPU/GPU), you do not need to provide the plugin id or model name—the plugin parameter is not required.
Model NamePlugin IDHuggingface repository name
NexaAI/EmbedNeural-ANEaneNexaAI/EmbedNeural-ANE
NexaAI/parakeet-tdt-0.6b-v3-aneaneNexaAI/parakeet-tdt-0.6b-v3-ane

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

import NexaSdk

// Load model
let asr = try Asr()
try await asr.load(from: "<path/to/model/dir>")

// Transcribe audio file
let result = try await asr.transcribe(options: .init(audioPath: "<your-audio-path>"))
print(result.asrResult)

ASR stream mode

import NexaSdk

// Load model
let asr = try Asr()
try await asr.load(from: "<path/to/model/dir>")

do {
    // start record and stream
    let stream = try asr.startRecordingStream()
    for try await content in stream {
        print(content)
    }
    // stop stream
    asr.stopRecordingStream()
} catch {
    asr.stopRecordingStream()
}

API Reference

Core Methods

func load(from repoFolder: URL) async throws
  • Loads an ASR model from a HuggingFace-format local repository folder
  • Parameters:
    • repoFolder: The folder containing the HuggingFace model files
  • Returns: None
  • Throws: Error if the model fails to load
  • Note: This is an async function and must be awaited
func startRecordingStream(config: ASRStreamConfig = .init(), block tapBlock: AVAudioNodeTapBlock? = nil)
  • Starts audio recording and ASR streaming simultaneously.
  • Parameters:
    • config: Streaming configuration.
    • tapBlock: Custom tap block to inspect or process audio samples.
  • Returns: none
func stopRecordingStream()
  • Stops both audio recording and ASR streaming.
  • Returns: None
func startRecording(block tapBlock: AVAudioNodeTapBlock? = nil)
  • Starts audio recording only.
  • Parameters:
    • tapBlock: Optional tap block to inspect or process audio.
  • Throws: Error if audio session or engine fails to start.
  • Returns: None
func stopRecording()
  • Stops the current audio recording session.
  • Returns: None
func startStream(config: ASRStreamConfig = .init()) throws -> AsyncThrowingStream<String, Error>
  • Starts ASR streaming mode.
  • Parameters:
    • config: Streaming configuration.
  • Returns: A stream that yields partial or final transcription text.
func stopStream(graceful: Bool = true)
  • Stops ASR streaming.
  • Parameters:
    • graceful:
      • true: Process remaining buffered audio before stopping (default).
      • false: Stop immediately.
  • Returns: None
func streamPushSamples(samples: [Float]) throws
  • Pushes raw audio samples into the streaming ASR pipeline for processing
  • Parameters:
    • samples: An array of PCM audio samples
  • Returns: None
  • Throws: Error if the streaming session is not active or the audio buffer cannot be processed

AsrResult

/// ASR transcription result
public struct AsrResult: Codable {
    /// Transcribed text
    public let transcript: String
    /// Confidence scores for each unit
    public let confidences: [Float]
    /// Timestamp pairs: [start, end] for each unit
    public let timestamps: [Float]
}

AsrResponse

public struct AsrResponse: Codable {
    public let asrResult: AsrResult
    public let profileData: ProfileData?
}

AsrOptions

public struct AsrOptions: Codable {
    public let modelPath: String
    public let language: Language
}

public enum Language: String, Codable {
    case en
    case ch
}

ASR streaming configuration

public struct ASRStreamConfig {
    // Timestamp mode
    public enum TimestampMode: String {
        case segment
        case word
        case none
    }

    // language
    public var language: Language = .en
    // Duration in seconds for each chunk (default: 4.0)
    public var chunkDuration: Float
    // Overlap between chunks in seconds (default: 3.0)
    public var overlapDuration: Float
    // Audio sample rate (default: 16000)
    public var sampleRate: Int32
    // Maximum chunks in processing queue (default: 10)
    public var maxQueueSize: Int32
    // Audio buffer size for input (default: 512)
    public var bufferSize: Int32
    // Timestamp mode: "none", "segment", "word" (default: none)
    public var timestamps: TimestampMode
    // Beam search size (default: 5)
    public var beamSize: Int32
}

Embeddings Usage

Generate vector embeddings for semantic search and RAG applications.

Basic Usage

import NexaSdk
import Foundation

// Load embedder for ANE model
let repoDir: URL = URL(string: "path/to/model/dir")!
let embedder = try Embedder(from: repoDir, plugin: .ane)

// Generate embeddings for multiple texts
let texts = ["<your-text1>", "<your-text2>"]
let result = try embedder.embed(texts: texts, config: .init(batchSize: Int32(texts.count)))
for (i, vec) in result.embeddings.enumerated() {
    let head = vec.prefix(10)
    print("[\(i)]", Array(head))
}

API Reference

Core Methods

convenience init(from repoFolder: URL, plugin: Plugin = .cpu_gpu)
  • Initializes an instance using a model stored in a local repository folder
  • Parameters:
    • repoFolder: Path to the local model repository folder
    • plugin: Backend plugin to use (cpu_gpu by default)
  • Returns: An initialized instance
  • Throws: Error if model loading or initialization fails
func embed(inputIds: [[Int32]], config: EmbeddingConfig) throws -> EmbedResult
  • Generates embeddings from pre-tokenized input IDs
  • Parameters:
    • inputIds: Array of tokenized sequences, each inner array is the token IDs for one sample
    • config: Embedding configuration
  • Returns: EmbedResult
  • Note: Supported only on the cpu_gpu plugin
func embed(texts: [String], config: EmbeddingConfig) throws -> EmbedResult
  • Generates embeddings for input text strings
  • Parameters:
    • texts: Array of input texts to embed
    • config: Embedding process configuration (batch size, normalization, etc.)
  • Returns: EmbedResult containing embeddings and profiling data
func embed(imagePaths: [String], config: EmbeddingConfig) throws -> EmbedResult
  • Generates embeddings for input images
  • Parameters:
    • imagePaths: Paths to input images
    • config: Embedding configuration
  • Returns: EmbedResult
func dim() throws -> Int32
  • Returns the embedding dimension for the model
  • Parameters: None
  • Returns: Int32 representing the embedding dimension

EmbeddingConfig

public struct EmbeddingConfig {
    /// Processing batch size
    public var batchSize: Int32

    /// Normalization: "l2", "mean"
    /// `nil` means not normalize embeddings
    public var normalizeMethod: NormalizeMethod?
}

EmbedResult

public struct EmbedResult {
    public let embeddings: [[Float]]
    public let profileData: ProfileData
}

LLM Usage

Large Language Models for text generation and chat applications.

Streaming Conversation - CPU/GPU

We support CPU/GPU inference for GGUF format models.
import NexaSdk

let llm = try LLM()
// load from exact gguf path
try await llm.load(.init(modelPath: "<path/to/model/file>"))

let system = "You are a helpful AI assistant"
let userMsgs = [
    "Tell me a long story, about 100 words",
    "How are you"
]
var messages = [ChatMessage]()
messages.append(.init(role: .system, content: system))
for userMsg in userMsgs {
    messages.append(.init(role: .user, content: userMsg))
    // generation
    let stream = try await llm.generateAsyncStream(messages: messages)
    var response = ""
    for try await token in stream {
        print(token, terminator: "")
        response += token
    }
    messages.append(.init(role: .assistant, content: response))
    print("\n\n")
}


Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

Streaming Conversation - CPU/GPU

We support CPU/GPU inference for GGUF format models.
import NexaSdk

let vlm = try VLM()
try await vlm.load(.init(modelPath: "<path/to/model/file>", mmprojPath: "<path/to/mmproj/file>"))

let images = ["<path/to/your/image>"]
var config = GenerationConfig.default
config.imagePaths = images
let message = ChatMessage(role: .user, content: "What do you see in this image", images: images)
let stream = try await vlm.generateAsyncStream(messages: [message], options: .init(config: config))

for try await token in stream {
    print(token, terminator: "")
}
print()

API Reference

Core Methods

func load(_ options: ModelOptions) async throws
  • Loads the model with the specified configuration
  • Parameters:
    • options: Model loading options
  • Returns: None
  • Throws: Error if the model fails to load
func load(from repoFolder: URL, modelFileName: String = "", mmprojFileName: String = "") throws
  • Loads the model from a local HuggingFace repository directory
  • Parameters:
    • repoFolder: Local HuggingFace repository directory
    • modelFileName: Model file name; if empty, the default is used
    • mmprojFileName: mmproj file name; if empty, the default is used
  • Returns: None
  • Throws: Error if model loading fails
func applyChatTemplate(messages: [ChatMessage], options: ChatTemplateOptions) async throws -> String
  • Applies the model’s chat template and formats messages accordingly
  • Parameters:
    • messages: Messages to format
    • options: Template configuration options
  • Returns: String representing the formatted prompt
  • Throws: Error if the template cannot be applied
func generateAsyncStream(messages: [ChatMessage], options: GenerationOptions) async throws -> AsyncThrowingStream<String, any Error>
  • Generates text in streaming fashion from chat messages
  • Parameters:
    • messages: Chat history used for generation
    • options: Generation configuration
  • Returns: yielding generated text tokens
  • Throws: Error if generation fails to start
func generate(prompt: String, config: GenerationConfig) async throws -> GenerateResult
  • Generates text for a single prompt
  • Parameters:
    • prompt: Input prompt
    • config: Generation configuration
  • Returns: GenerateResult
  • Throws: Error if generation fails
func reset()
  • Resets the model’s internal state
  • Parameters: None
  • Returns: None
func stopStream()
  • Stops generation streaming session
  • Parameters: None
  • Returns: None
func saveKVCache(to path: String)
  • Saves the current KV cache to the specified file path
  • Parameters:
    • path (String): Destination file path
  • Returns: None
  • Notes: Only available for LLM models
func loadKVCache(from path: String)
  • Loads KV cache from the specified file path
  • Parameters:
    • path (String): Source file path
  • Returns: None
  • Notes: Only available for LLM models

GenerationConfig

public struct GenerationConfig {
    // Maximum tokens to generate
    public var maxTokens: Int32 
    // Array of stop sequences                
    public var stop: [String]  
    // Number of past tokens to consider                
    public var nPast: Int32           
    // Advanced sampling config          
    public var samplerConfig: SamplerConfig 
    // Array of image paths for VLM, empty for none   
    public var imagePaths: [String] 
    // Array of audio paths for VLM, empty for none          
    public var audioPaths: [String]            
}

SamplerConfig

public struct SamplerConfig {
    // Sampling temperature (0.0-2.0)
    public var temperature: Float
    // Nucleus sampling parameter (0.0-1.0)
    public var topP: Float
    // Top-k sampling parameter
    public var topK: Int32
    // Minimum probability for nucleus sampling
    public var minP: Float
    // Penalty for repeated tokens
    public var repetitionPenalty: Float
    // Penalty for token presence
    public var presencePenalty: Float
    // Penalty for token frequency
    public var frequencyPenalty: Float
    // Random seed (-1 for random)
    public var seed: Int32
    // Optional grammar file path
    public var grammarPath: String?
 }

ProfileData

public struct ProfileData: CustomStringConvertible, Codable {

    /// Time to first token (us)
    public let ttft: Int64

    /// Prompt processing time (us)
    public let promptTime: Int64

    /// Token generation time (us)
    public let decodeTime: Int64

    /// Number of prompt tokens
    public let promptTokens: Int64

    /// Number of generated tokens
    public let generatedTokens: Int64

    /// Audio duration (us)
    public let audioDuration: Int64

    /// Prefill speed (tokens/sec)
    public let prefillSpeed: Double

    /// Decoding speed (tokens/sec)
    public let decodingSpeed: Double

    /// Real-Time Factor (RTF) (1.0 = real-time, >1.0 = faster, <1.0 = slower)
    public let realTimeFactor: Double

    /// Stop reason: "eos", "length", "user", "stop_sequence"
    public let stopReason: String
}


Rerank Usage

Improve search relevance by reranking documents based on query relevance.

Basic Usage

import NexaSdk
import Foundation

// Load reranker model for CPU/GPU inference
let repoDir = URL(string: "<path/to/model/dir>")!
let reranker = try Reranker(from: repoDir)
let query = "What is machine learning?"
let documents = [
    "Machine learning is a subset of artificial intelligence that enables computers to learn and make decisions ",
    "without being explicitly programmed.",
    "Machine learning algorithms build mathematical models based on sample data to make predictions or decisions.",
    "Deep learning is a subset of machine learning that uses neural networks with multiple layers.",
    "Python is a popular programming language for machine learning and data science.",
    "The weather today is sunny and warm."
]
let result = try await reranker.rerank(query, documents: documents)
print(result.scores)

API Reference

Core Methods

init(modelPath: String, tokenizerPath: String? = nil, deviceId: String? = nil, plugin: Plugin = .cpu_gpu) throws
  • Initializes a reranker model from local file paths
  • Parameters:
    • modelPath: Path to the reranker model
    • tokenizerPath: Path to tokenizer
    • deviceId: Device identifier; if nil, uses default backend setup
    • plugin: Backend plugin (default cpu_gpu)
  • Returns: Instance of the reranker
  • Throws: Error if initialization fails
func rerank(_ query: String, documents: [String], config: RerankConfig = .init()) async throws -> RerankerResult
  • Performs document ranking given a query and a list of documents
  • Parameters:
    • query: Query string to evaluate
    • documents: List of documents to rank
    • config: Reranking configuration
  • Returns: RerankerResult
  • Throws: Error during reranking execution
convenience init(from repoFolder: URL, plugin: Plugin = .cpu_gpu) throws
  • Convenience initializer that loads a reranker from a HuggingFace-style local repository
  • Parameters:
    • repoFolder: Local repository directory
    • plugin: Backend plugin (default cpu_gpu)
  • Returns: Instance of the reranker
  • Throws: Error if loading fails

RerankerResult

public struct RerankerResult {
    public let scores: [Float]
    public let profileData: ProfileData
}

RerankConfig

public struct RerankConfig {
    /// Processing batch size
    public var batchSize: Int32

    /// `nil` means not normalize scores
    /// Normalization: "softmax", "min-max", "l2"
    public var normalizeMethod: NormalizeMethod?

    public enum NormalizeMethod: String, CaseIterable {
        case softmax
        case minMax = "min-max"
        case l2
    }
}

How to use CPU/GPU, ANE

Currently, two hardware acceleration mode are offered by Nexa iOS / macOS SDK - CPU/GPU or ANE NPU acceleration. A model can only be run on a certain hardware acceleration. Please carefully read the model card on huggingface to correctly use them.

CPU/GPU Mode

// CPU/GPU embedder - uses GGUF format models
let embedder = try Embedder(from: "<path/to/model/dir>", plugin: .cpu_gpu)

ANE Mode

// ANE embedder - uses CoreML format models
let embedder = try Embedder(from: "<path/to/model/dir>", plugin: .ane)

The Embedder modality supports both CPU/GPU and ANE execution. LLM, VLM, and Reranker modules support CPU/GPU only. The ASR module runs exclusively on ANE and does not support CPU/GPU execution.

Need Help?

Join our community to get support, share your projects, and connect with other developers.