Our Projects

Explore our comprehensive portfolio of open-source contributions, including state-of-the-art models, massive datasets, and specialized tools for the AI research community.

SparkEmbedding-300m

Text EmbeddingsMultilingual

A state-of-the-art multilingual text embedding model by XenArcAI, fine-tuned from EmbeddingGemma. Optimized for cross-lingual retrieval, semantic search, and Matryoshka Representation Learning (MRL). Supports 119 languages with a 2048-token context window, designed for high-efficiency scalable deployment.

parameters0.3B

context2048 tokens

languages119

View on Hugging Face

MathX Dataset

Mathematical ReasoningTraining Data

A high-quality, synthetically curated, and meticulously filtered dataset designed for advanced mathematical reasoning and AI model training.

size5M

items1

View on Hugging Face

CodeX

Synthetic DataReasoningCode

A massive collection of pre-curated coding datasets by XenArcAI, featuring over 9 million samples. Includes 'CodeX-2M-Thinking' with step-by-step reasoning for instruction tuning and 'CodeX-7M-Non-Thinking' for raw pattern learning.

samples9.5M+

datasets2

size10GB+

View on Hugging Face

AIRealNet

AI DetectionComputer Vision

A binary image classifier designed to distinguish AI-generated images from real human photographs. Built on Microsoft's Swinv2 Tiny architecture for high accuracy and efficient deployment. Prioritizes privacy by excluding personal data from training.

downloads10k/month

parameters0.2B

View on Hugging Face

Our Projects

Explore our comprehensive portfolio of open-source contributions, including state-of-the-art models, massive datasets, and specialized tools for the AI research community.