NexaAI Python SDK API Reference
This document provides comprehensive API documentation for all modules and classes in the NexaAI Python SDK. For platform-specific setup and complete examples, please refer to the platform guides:- macOS Guide - Apple Silicon optimization
- Windows x64 Guide - CPU/GPU acceleration
- Windows ARM64 Guide - NPU acceleration
Core Modules
LLM (Large Language Model)
TheLLM class provides text generation and conversation capabilities.
Initialization
Core Methods
generate_stream(prompt, g_cfg)
- Generates text tokens in streaming fashion
- Parameters:
prompt(str): Input promptg_cfg(GenerationConfig): Generation configuration
- Returns: Generator yielding text tokens
apply_chat_template(conversation)
- Applies chat template to conversation
- Parameters:
conversation(List[ChatMessage]): Conversation history
- Returns: Formatted prompt string
get_profiling_data()
- Returns performance profiling information
- Returns: Dict with profiling metrics or None
save_kv_cache(path)
- Saves key-value cache to file
- Parameters:
path(str): File path to save cache
load_kv_cache(path)
- Loads key-value cache from file
- Parameters:
path(str): File path to load cache
reset()
- Resets conversation state and clears cache
VLM (Vision Language Model)
TheVLM class provides multimodal understanding and generation capabilities.
Initialization
Core Methods
generate_stream(prompt, g_cfg)
- Generates text tokens in streaming fashion
- Parameters:
prompt(str): Input promptg_cfg(GenerationConfig): Generation configuration with image_paths/audio_paths
- Returns: Generator yielding text tokens
apply_chat_template(conversation)
- Applies chat template to multimodal conversation
- Parameters:
conversation(List[MultiModalMessage]): Multimodal conversation history
- Returns: Formatted prompt string
get_profiling_data()
- Returns performance profiling information
- Returns: Dict with profiling metrics or None
Embedder
TheEmbedder class provides text vectorization and similarity computation.
Initialization
Core Methods
generate(texts, config)
- Generates embeddings for input texts
- Parameters:
texts(List[str]): List of input textsconfig(EmbeddingConfig): Embedding configuration
- Returns: List of embedding vectors
get_embedding_dim()
- Returns the dimension of embeddings
- Returns: Integer dimension size
Reranker
TheReranker class provides document reranking capabilities.
Initialization
Core Methods
rerank(query, documents, config)
- Reranks documents based on query relevance
- Parameters:
query(str): Search querydocuments(List[str]): List of documents to rankconfig(RerankConfig): Reranking configuration
- Returns: List of relevance scores
ASR (Automatic Speech Recognition)
TheASR class provides speech-to-text transcription capabilities.
Initialization
Core Methods
transcribe(audio_path, language, config)
- Transcribes audio file to text
- Parameters:
audio_path(str): Path to audio filelanguage(str): Language code (“en”, “zh”, or "" for auto-detect)config(ASRConfig): ASR configuration
- Returns: Transcription result object
get_profiling_data()
- Returns performance profiling information
- Returns: Dict with profiling metrics or None
CV (Computer Vision)
TheCVModel class provides computer vision capabilities including OCR.
Initialization
Core Methods
infer(image_path)
- Performs inference on image
- Parameters:
image_path(str): Path to input image
- Returns: CVResults object with detection results
Configuration Classes
ModelConfig
General model configuration.GenerationConfig
Text generation configuration.EmbeddingConfig
Text embedding configuration.RerankConfig
Document reranking configuration.ASRConfig
Speech recognition configuration.CVModelConfig
Computer vision model configuration.Message Classes
ChatMessage
Represents a single message in a conversation.MultiModalMessage
Represents a multimodal message with multiple content types.MultiModalMessageContent
Represents individual content within a multimodal message.Plugin ID Options
Theplugin_id parameter supports different backends:
cpu_gpu: Default, supports both CPU and GPUmetal: Apple Silicon optimized (macOS)mlx: MLX backend (macOS)npu: NPU acceleration (Windows ARM64, Snapdragon X Elite)nexaml: NexaML optimized backendllama_cpp: For GGUF format modelsonnx: ONNX runtime backend
Error Handling
The SDK provides comprehensive error handling with descriptive error messages. Common error scenarios include:- Invalid model paths
- Unsupported plugin/device combinations
- Authentication token issues
- Memory allocation failures
- Model loading errors
Performance Tips
- Model Selection: Choose models optimized for your platform and use case
- Batch Processing: Use appropriate batch sizes for embedding and reranking tasks
- Memory Management: Monitor memory usage, especially for large models
- Caching: Use KV cache for conversational applications
- Profiling: Use
get_profiling_data()to monitor performance
Next Steps
- Explore platform-specific examples in the macOS Guide
- Check out the Windows x64 Guide for CPU/GPU optimization
- Visit the Windows ARM64 Guide for NPU acceleration
Was this page helpful?