Our Projects
Explore our comprehensive portfolio of open-source contributions, including state-of-the-art models, massive datasets, and specialized tools for the AI research community.
SparkEmbedding-300m
A state-of-the-art multilingual text embedding model by XenArcAI, fine-tuned from EmbeddingGemma. Optimized for cross-lingual retrieval, semantic search, and Matryoshka Representation Learning (MRL). Supports 119 languages with a 2048-token context window, designed for high-efficiency scalable deployment.
MathX Dataset
A high-quality, synthetically curated, and meticulously filtered dataset designed for advanced mathematical reasoning and AI model training.
CodeX
A massive collection of pre-curated coding datasets by XenArcAI, featuring over 9 million samples. Includes 'CodeX-2M-Thinking' with step-by-step reasoning for instruction tuning and 'CodeX-7M-Non-Thinking' for raw pattern learning.
AIRealNet
A binary image classifier designed to distinguish AI-generated images from real human photographs. Built on Microsoft's Swinv2 Tiny architecture for high accuracy and efficient deployment. Prioritizes privacy by excluding personal data from training.