Press start and enter a short (less than 20 character) prompt (only letters and spaces allowed). This project implements a real transformer-based language model entirely in Scratch. The model has around 9,400 parameters. That's nothing by modern standards, but every part of the architecture is there: token and positional embeddings, causal multi-head self-attention (2 heads), layer normalization, a feed-forward network with ReLU activation, residual connections, and weight-tied output. It generates one token (character) at a time, just like full-scale GPT models do. The responses aren't great with only 9k parameters and a 32-character context window. It can barely string a sentence together but it does produce real English words and occasionally coherent replies.
Trained in PyTorch 2.10 with the roskoN/dailydialog dataset. Weights are exported to text files to import into Scratch lists.