Cao Labs - Compact MoE n-gram

MAMatt-38•Created November 22, 2025

115 views

Instructions

N-gram MoE + Backoff Model (191 blocks) 11/23/25 : fix bug in MoE (pass from 188 blocks to 191) This model is an optimized, general-purpose n-gram architecture using a Mixture-of-Experts and adaptive backoff. Context window: 100 characters (sliding window) Global context: virtually infinite (streaming-compatible) Generation mode: character-level, with optional multi-token output (3 characters at once) Key Features • Mixture-of-Experts (MoE) A new expert is selected every 100 tokens, using a 10 000-character chunk of data. This makes the model scalable to extremely large datasets (e.g., 1 260 468 characters in the Cao Phi / Magic dataset), possibly making it one of the largest N-gram models ever built on Scratch. • Infinite-context generation The model can process arbitrarily long inputs without a significant slowdown, thanks to streaming n-gram computation and lightweight expert switching. • Multi-token generation Although character-based, the model can generate multiple characters at once (“hel” instead of “h”), greatly increasing generation speed. • On-the-fly inference (no training needed) The model dynamically computes n-grams directly from the dataset during generation. This ensures adaptability, but very large datasets may slow down generation unless multi-token mode is used. (data By @Matt-38 and text By GPT-5.1)

Project Details

Project ID1246423889

CreatedNovember 22, 2025

Last ModifiedNovember 23, 2025

SharedNovember 22, 2025

CommentsAllowed