BPE Tokenizer

SESenievol-PIayground•Shared April 6, 2025

112 views

Love/View Ratio: 7.14%

Instructions

This is a BPE tokenizer that shares the same vocabulary and merge rules as GPT-4o, GPT-4.5, o1, o3, and all other recent OpenAI models. The vocab is called o200k_base, because it has about 200k tokens, which is just at the limit for scratch lists. However it can be slow for very long texts, so if you need a faster tokenizer you should use a smaller vocab. For example GPT-2's vocab contains only 50k tokens.

Notes & Credits

Text Engine by @PixelBuzz See tiktoken by OpenAI: https://github.com/openai/tiktoken See Tiktokenizer: https://tiktokenizer.vercel.app/?model=o200k_base Learn about tokenization: https://platform.openai.com/tokenizer

Project Details

Project ID1158023980

Search IndexUnindexed / NFE

CreatedApril 6, 2025

Last ModifiedJanuary 7, 2026

SharedApril 6, 2025

CommentsAllowed