Explain Tokenization to Freshers

🎉 Welcome, brave fresher! ⚔️ to the fascinating and occasionally bewildering world of Machine Learning! You’ve probably heard the mantra: “Computers don’t understand words; they understand numbers.” It’s the fundamental law of the land, the secret sauce behind every AI marvel from chatbots to image generators.

But this raises a critical, pizza-related question: how on earth do we take a beautiful, nuanced sentence like “I’m craving a pepperoni pizza!” and transform it into a language of cold, hard numbers that a computer can compute?

The answer is a two-step dance, and the all-important first step is Tokenization.

Step 1: Tokenization: chops text into pieces (words/sub-words)
Step 2: Vectorisation: converts words into a special code of numbers, allowing the computer to understand what words mean and how they relate to each other.

Diagram: 2-Step process (Tokenization + Vectorisation)

This blog focuses entirely on Step 1: Tokenization.
This is the process of taking our whole text pizza and slicing it into manageable pieces, called tokens. But the story doesn't end with simple slices. To handle the infinite variety of human language, we need a smarter, more efficient way to chop—a method known as Byte-Pair Encoding (BPE), which I'll call The Pepperoni Salami Method throughout this guide to make it more intuitive."

Let’s walk through this entire process, from the initial chop to the final conversion into numbers, step-by-step.

Step 0: The Whole Pizza (The Text)

Imagine you have a full pizza. You’re hungry. Do you shove the whole thing in your mouth? (Please say no.) You slice it up! Now think of a sentence like:

"I want a pepperoni pizza!"

Our mission is to take this textual pizza and prepare it for our computer friend. But remember the golden rule: the computer's ultimate goal is to convert every slice into a number (a vector) it can use in its mathematical models. This whole process starts with something called tokenization.

Step 1: The Naive Approach: Word-Level Tokenization

This is the straightforward part. You grab your trusty pizza cutter (the tokenizer) and slice where you see natural breaks—usually spaces and punctuation. That means you take a sentence and split it by spaces and punctuation.

The Chop: "I want a pepperoni pizza!" → ["I", "want", "a", "pepperoni", "pizza", "!"]

Boom! We have tokens based on spaces and punctuation. These are our basic slices. This is called word-level tokenization.

What Simple Tokenization Does:
Space-based: "I want a pepperoni pizza!" → ["I", "want", "a", "pepperoni", "pizza", "!"]

Character-based: "pepperoni" → ["p", "e", "p", "p", "e", "r", "o", "n", "i"]

Rule-based: Might split on punctuation: "can't" → ["can", "'", "t"]

Diagram: Simple word-level tokenization splits text at spaces and punctuation marks.

The Numbering (Vectorisation): Now, we hand these slices to the computer. It looks up each token in a giant dictionary (its vocabulary) and replaces it with a unique ID number. So, our sentence might become:
[101, 245, 7, 30482, 456, 999]. This is the crucial bridge from words to numbers. The computer now happily works with [101, 245, 7, 30482, 456, 999], not the original words.

The Fatal Flaw: The Out-of-Vocabulary (OOV) Problem

But what's the problem? Vocabulary size. If a new, rare word like “pepperonilicious” appears, it won’t be in the dictionary. The computer throws its hands up and replaces it with an <UNK> (unknown) token. This is the Out-of-Vocabulary (OOV) Problem. A model has a pre-built dictionary, called a vocabulary. Imagine it's a chef's pantry with jars, each labelled with a known word (token).

"I" -> Jar #101
"want" -> Jar #245
"pepperoni" -> Jar #30482

This works until you encounter a word not in the pantry, like "pepperonilicious". The chef has no jar for this. The system fails, marking it as <UNK> (Unknown). This is rigid and inefficient, requiring a nearly infinite pantry to handle every possible word. We need a smarter way to slice that builds a more efficient vocabulary.

Diagram: When a word isn't in the vocabulary, it becomes a <UNK> token, causing processing failures.

Step 2: Pepperoni Salami Method: A Tasty Guide to Subword Tokenization (Byte-Pair Encoding)

Let me introduce you to the "Pepperoni Salami Method" - the secret sauce behind how AI understands language. This analogy will change how you see every AI interaction.

The Core Insight: Why We Don't Use Raw Ingredients

Imagine you're a pizza chef. A customer orders a pepperoni pizza. What do you do?

The Wrong Way (Word-Level Thinking):

Grab raw spices, meat and casings
Throw them separately onto the pizza
Hope they magically become pepperoni in the oven

The Right Way (BPE Thinking):

Use pre-made pepperoni slices from a prepared salami log
Get consistent, delicious results every time

Byte-Pair Encoding (BPE) follows the "prepared slices" approach. Instead of working with raw characters, it builds reusable language components.

Step 1: The Basic Ingredients (Start with Raw Text)
First, you gather your raw, basic ingredients: ground meat, salt, paprika, garlic powder, and other spices. Similarly, BPE starts with the most basic units of text: every individual character—every letter, space, and punctuation mark. At this stage, a word is just a messy pile of raw ingredients. The word "pepperoni" is just a sequence of characters: p, e, p, p, e, r, o, n, i.

Technical Diagram: Start with raw text

Step 2: The First Mix (Finding Common Pairs):

In the Kitchen: You notice that paprika and garlic powder are almost always used together. Instead of measuring them separately every time, you create a "spice blend" jar.

In BPE: The algorithm analyses billions of words and finds that p and e frequently appear together. It merges them into a new token: pe

Technical Diagram: Creating the First Combinations

Step 3: Iterative Refining (Building the Salami Log):

In the Kitchen:
You don't stop with one blend. You systematically build up your pepperoni until you have a final, seasoned sausage log—the pepperoni.

Spice blend (paprika and garlic powder mix) + meat = seasoned mixture
seasoned mixture + curing = sausage log
sausage log + ageing = pepperoni salami

In BPE:
The algorithm iteratively builds more complex tokens: The Training Process in Action

BPE Training Process

Training Data = The large collection of text (books, websites, articles) that BPE analyses to learn which character combinations are most common.

Why it's required: BPE cannot decide which pairs to merge without seeing real-world text patterns. The training data tells it what's actually frequent and useful.

training_data = [
    "pepper pizza",    # Shows "pepper" is common
    "pepperoni pizza", # Shows "pepperoni" is common  
    "hot pepper",      # Reinforces "pepper" frequency
    "macaroni cheese"  # Provides "oni" pattern
    "delicious"        # Provides "licious" pattern
]

Step-by-Step BPE Process:

STEP 1: Start with Individual Characters

# All text split into single characters
# It doesn't know "pepper" or "pepperoni" exist as words yet!

Complete character breakdown of training data
"pepper pizza",    # → ['p','e','p','p','e','r',' ','p','i','z','z','a']
"pepperoni pizza", # → ['p','e','p','p','e','r','o','n','i',' ','p','i','z','z','a']  
"hot pepper",      # → ['h','o','t',' ','p','e','p','p','e','r']
"macaroni cheese"  # → ['m','a','c','a','r','o','n','i',' ','c','h','e','e','s','e']

Initial Vocabulary (All Unique Characters)
{'p', 'e', 'r', 'o', 'n', 'i', 'a', 'z', 'h', 't', 'm', 'c', ' '}

STEP 2: Count All Character Pairs

# Count how many times each pair appears in ALL training data:
# 'p'+'e' appears 6 times ← MOST FREQUENT
# Breakdown: 
# - "pepper" has 'p'+'e' at positions 0-1 and 3-4 = 2 times per occurrence
# - "pepper" appears twice in data = 2 × 2 = 4 times
# - "pepperoni" has 'p'+'e' at positions 0-1 and 3-4 = 2 times
# - Total: 4 + 2 = 6 times

MERGE 1: Create 'pe' (Most Frequent Pair)

vocabulary.add('pe')
# Now the words become:
# "pepper" → "pe" + "p" + "p" + "e" + "r"
# "pepperoni" → "pe" + "p" + "p" + "e" + "r" + "o" + "n" + "i"

STEP 3: Recount with New Tokens

# Now count pairs including the new 'pe' token:
# 'pe'+'p' appears 3 times ← NEW MOST FREQUENT
# 'p'+'p' appears 2 times
# 'e'+'r' appears 2 times

Why 'pe' + 'p' appears 3 times:
After the first merge, we look at the new token sequences:

"pepper" from "pepper pizza": ["pe", "p", "p", "e", "r"] → 1 occurrence of ('pe', 'p')
"pepper" from "hot pepper": ["pe", "p", "p", "e", "r"] → 1 occurrence of ('pe', 'p')
"pepperoni" from "pepperoni pizza": ["pe", "p", "p", "e", "r", "o", "n", "i"] → 1 occurrence of ('pe', 'p')

Total = 3 occurrences of the pair ('pe', 'p'), making it the new most frequent pair.

MERGE 2: Create 'pep'

vocabulary.add('pep')
# "pepper" → "pep" + "p" + "e" + "r"
# "pepperoni" → "pep" + "p" + "e" + "r" + "o" + "n" + "i"

CONTINUE Merging Most Frequent Pairs:

Next might merge 'p'+'e' again (in the remaining parts)
Then 'pep'+'p' to get 'pepp'
Then 'pepp'+'er' to get 'pepper'
Then, 'pepper'+'oni' to get 'pepperoni'
Eventually pepper + oni (from macaroni) + licious (from delicious) = pepperonilicious

The Magic: Handling Never-Seen-Before Words

Now for the cool part. What if someone orders a "pepperonilicious" pizza? You've never seen that word before! But you don't panic. You break it down using the efficient, pre-made chunks you've already mastered: "pepperoni" + "licious"

Technical Diagram: Handling Novel Words

Why This Works Brilliantly:

pepperoni = we know this from BPE Step 3
licious = known from "delicious" in BPE Step 3
Combined meaning = "extremely delicious like pepperoni"

This is the real power of BPE. The AI might never have seen the word "pepperonilicious" in its training data. But because it has learned efficient chunks, it doesn't need to start from scratch. It breaks the new word into meaningful pieces it already understands—pepperoni and licious—allowing it to handle the new concept with ease.

Step 3: Serving the Numbered Pizza (The Final Token IDs)

After applying our trained BPE rules, our original sentence is tokenized into efficient subword chunks. The final, crucial step is, once again, converting these chunks to numbers.

Original: "I want a pepperonilicious pizza!"

After BPE Slicing: ["I</w>", "want</w>", "a</w>", "pepperoni", "licious</w>", "pizza</w>", "!</w>"]

Final Number Conversion: [101, 245, 7, 23481, 21540, 456, 999]

The beauty is that "pepper" and "oni" have their own IDs (4587, 2099) that can be reused to construct countless other words, making the vocabulary compact and powerful.

The Power of Reusable Components

The Old Way (Word-Based Tokenization):

Imagine a kitchen that needs a separate, pre-made jar for every single word imaginable. It would need one jar for "pepperoni", a different jar for "extra-pepperoni", and a whole new jar it has never seen before for "pepperonilicious" (here, jar means tokens). This kitchen would need an infinite, impossible-to-manage warehouse. It's rigid and inefficient.

Diagram: Traditional Word-Based Approach

The Smart Way (BPE/Subword Tokenization):

This kitchen has a compact, smart pantry. It keeps jars of the most useful word parts, like the common compound "pepperoni" and the reusable suffix "licious". When a new order "pepperonilicious" comes in, the chef grabs the "pepperoni" jar and the "licious" jar (here, jar means tokens). It's flexible, efficient, and ready for anything. The diagram below illustrates the efficient reuse of tokens. The same token (ID #5000 for 'pepperoni') is used across different contexts, avoiding the need for new tokens. This is shown by the three jars representing the subword units: 'pepperoni', 'extra', and 'licious'."

Diagram: BPE (Pepperoni Slices) Approach

So next time you see "pepperoni," remember: it's not just a tasty topping 😋 —it's a masterclass in how AI learns to speak our language by discovering the perfect, reusable chunks. 🍕

The Takeaway: You're Now a Tokenization Chef

So, to recap your new culinary skills:

Step 0: You have a whole-text pizza.
Step 1 (Basic Slicing): Use simple tokenization to get word slices, then convert them to numbers.
Step 2 (Pepperoni & BPE): Use BPE to break words into efficient, reusable subword pieces. This builds a smarter vocabulary that minimises unknown words.
Step 3 (The Final Serve): Convert these subword tokens into their numerical IDs, creating the perfect, machine-readable meal.

Real-World Examples: The Salami Method in Action

Example 1: Breaking Down Complex Words

The word "antidisestablishmentarianism" is broken into pieces the model already knows: ["anti", "dis", "establish", "ment", "arian", "ism"].

"anti" appeared frequently (in "anti-war", "antibiotic", "antivirus")
"dis" appeared frequently (in "dislike", "disagree", "disable")
"establish" appeared frequently (in "establishment", "established")
"ment" appeared frequently (in "government", "development")
"ism" appeared frequently (in "capitalism", "socialism")

The "meaningful" breakdown is a coincidental byproduct of statistical frequency.

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

Example 2: Handling Misspellings & Variations
The chunks ["pe", "per", "oni"] statistically overlap with the correct spelling["pepper", "oni"]` enough that the model can still generate a reasonable response based on pattern recognition.

"peperoni" (misspelled) → ["pe", "per", "oni"]  # Still understood!
"pepperonipizza" (no space) → ["pepperoni", "pizza"]

What Simple Tokenizer Doesn't Do (That BPE Does):

1. Doesn't Break Words Internally

Simple: "pepperoni" → Stays as one chunk ["pepperoni"]
BPE: "pepperoni" → Can break into ["pepper", "oni"] . It doesn't "realise" pepperoni contains pepper → but it statistically learns that "pepper" is a frequent character sequence

2. Doesn't Reuse Word Parts

Simple: "pepperoni" and "peppermint" are completely separate, unrelated tokens
BPE: share the reusable token number for words containing "pepper"

3. Doesn't Handle Unknown Words

Simple: "pepperonilicious" → Fails with [UNK]
BPE: "pepperonilicious" → Builds from known parts ["pepperoni", "licious"]. It doesn't "understand" food relationships → but it learns that "pepperoni" and "pizza" often appear together in training data

4. Doesn't Learn from Data Patterns

Simple: Depends on fixed rules (spaces, punctuation only), not on training data
BPE: Learns which character combinations are most frequent and useful

5. Doesn't Handle Misspellings Gracefully

Simple: "peperoni" (misspelled) → Fails completely
BPE: "peperoni" → Approximates with ["pe", "per", "oni"]

In essence, the Simple tokenizer just splits text, while BPE analyses and reuses internal word structure based on statistical patterns in the training data.

Why This Revolutionised AI Language Understanding

Before BPE (The Dark Ages):

Vocabulary size: 200,000+ words
Couldn't handle new words
Wasted storage on rare words
Rigid and brittle

After BPE (The Enlightenment):

Vocabulary size: ~50,000 subword units
Handles infinite new words
Efficient and compact
Flexible and robust

Technical Diagram: The Efficiency Revolution

Why AI Chokes on Pizza, Math, and Made-Up Words: The Delicious Limits of Tokenization

Understanding this isn't just nerdy trivia. It helps you see why AI acts the way it does:

Why does it stop mid-word?

The chef has a small countertop. He can only fit 10 jars on it at a time.

This small countertop is the "token limit."

If your order is too long, he can only line up 10 jars. He has to make the pizza with just those, and then his counter is full. He can't even grab the "oni" to finish the word "pepperoni" because there's no space!
GPT, in this case, creates a sentence like this, “It's like a sentence that gets cut off mid-w”.
It's forced to stop exactly when its token limit is maxed out, even if that means cutting a word or code in half. The response you get is simply the output generated just before the limit was hit.

There are two main types of solutions for this issue

Technical Solutions (Handled by the System):
- Sliding Window: The model processes the text in chunks, like a sliding window, keeping only the most recent tokens within the limit. It loses the broader context from the beginning.
- Summarisation/Abstraction: A smarter system first summarises or extracts key information from a long document and then feeds only that condensed version into the model, staying under the token limit.
- Hierarchical Processing: The system breaks the long text into parts, processes each part separately, and then combines the results. While each part has its own set of tokens, no single part exceeds the model's token limit.
User Solutions (What You Can Do):
- Better Prompts: The most common and effective fix. You can instruct the model to "be concise", "summarise your next answer", or "continue from where you left off."
- Provide a Summary: If you have a long document, you can provide a summary yourself and ask the model to work with that.
- Chunking: You break your long query into smaller parts and have multiple conversations (each conversation with a separate token limit) with the model, piecing the answers together yourself

In short, the solution isn't to make the countertop bigger (which is hardware-limited) but to be smarter about what we put on it—prioritising the most crucial information and using techniques to manage long context.

Why is tokenization bad at math?

The Problem: Numbers Get Chopped Into Pieces

You ask: "What is 1,234 + 5,678?"

What the AI actually sees after tokenization:
```
  Original: "What is 1,234 + 5,678?"

  Tokenized: ["What", " is", " 1", ",", "234", " +", " 5", ",", "678", "?"]

  Token IDs: [200, 201, 202, 203, 204, 205, 206, 203, 207, 208]
```
Why This Creates Math Problems

The AI's Challenge:
- It receives: [202, 203, 204] for "1,234"
- These are three separate tokens, not one number
- The model has to reassemble that this means "one thousand two hundred thirty-four"
- The comma , token appears in many contexts (lists, thousands separators)

Compare to Human Thinking:

    Human: "1,234" → Single concept: One thousand two hundred thirty-four
    AI: "1,234" → Three concepts: ["1", ",", "234"]

Real Calculation Example

Let's trace what happens with a simpler problem: Problem: "Calculate 25 + 38"
```
  ["Calculate", " 25", " +", " 38"]

  Token IDs: [150, 251, 205, 252]
```
The AI's Thought Process:

The Risk: If AI recognises an addition pattern here, then it works fine and adds the number, 25 + 38 = 63, but if the AI hasn't seen enough examples of addition and numbers, it might calculate incorrectly because it's working with token patterns rather than true mathematical understanding.
The Silver Lining:

GPT now has to do math by looking at a number jar ("1"), a comma jar (","), and another number jar ("234"). He doesn't see "one thousand two hundred thirty-four" as one thing. He sees three separate, weird ingredients. Trying to add numbers this way is a nightmare, so it often results in incorrect math.

While tokenization may initially break numbers into awkward pieces, this is just the initial processing step. Through its deeper layered understanding, the model can reliably handle the math seamlessly from there.

So, Why Should You Care?

🤖 Because you're no longer just an AI user — you're an AI understander. 🧠

Every time you chat with ChatGPT, use a translator, or ask Siri a question, you're tapping into the invisible tokenization engine that turns human language into machine numbers. 🔤➡️🔢
What once seemed like magic ✨ now has a clear, logical process behind it.

You now hold the secret decoder ring for AI's quirks.
When it cuts off mid-sentence, you know it hit a token limit. 🛑
When it struggles with math, you get why. 🧮
When it understands a word it’s never seen, you appreciate the elegance of reusable pieces. 🧩

This knowledge transforms you from a passive consumer into an informed user.
You can write better prompts, debug weird replies, and truly appreciate the engineering marvel behind modern AI. 🚀

So next time you see a slice of 🍕 pepperoni pizza, you’ll see more than just a tasty topping — you’ll see the core principle that lets computers understand our world.
You're not just eating lunch — you're glimpsing the secret sauce behind the AI revolution. 🍅🤖

Welcome to the club of those who know how the magic works. 🪄🎩

💬 What surprised you most about how AI understands language? Share your thoughts below! 👇

Explaining Tokenization to Freshers: From Pizza Slices 🍕 to Data 💻🧠✨

Step 0: The Whole Pizza (The Text)

Step 1: The Naive Approach: Word-Level Tokenization

Step 2: Pepperoni Salami Method: A Tasty Guide to Subword Tokenization (Byte-Pair Encoding)

The Core Insight: Why We Don't Use Raw Ingredients

Step 3: Serving the Numbered Pizza (The Final Token IDs)

The Power of Reusable Components

The Takeaway: You're Now a Tokenization Chef

Real-World Examples: The Salami Method in Action

What Simple Tokenizer Doesn't Do (That BPE Does):

Why This Revolutionised AI Language Understanding

Why AI Chokes on Pizza, Math, and Made-Up Words: The Delicious Limits of Tokenization

Why does it stop mid-word?

Why is tokenization bad at math?

So, Why Should You Care?

Comments

More from this blog

Building Thinking Models: From Basic Prompts to AI Collaboration 🧠🛠️➡️🤖

Chef Cupcake's Secret Recipe is a Transformer Model 👨‍🍳🧁 🤖

Explaining Vector Embeddings to My Mom 👩‍🍳. Just Recipes & a Smart Fridge 🤖

🤖 Explaining GPT to a 5-Year-Old: The 'Child Brain' Analogy for AI 👧🧠

Command Palette

Step 0: The Whole Pizza (The Text)

Step 1: The Naive Approach: Word-Level Tokenization

Step 2: Pepperoni Salami Method: A Tasty Guide to Subword Tokenization (Byte-Pair Encoding)

The Core Insight: Why We Don't Use Raw Ingredients

Step 3: Serving the Numbered Pizza (The Final Token IDs)

The Power of Reusable Components

The Takeaway: You're Now a Tokenization Chef

Real-World Examples: The Salami Method in Action

What Simple Tokenizer Doesn't Do (That BPE Does):

Why This Revolutionised AI Language Understanding

Why AI Chokes on Pizza, Math, and Made-Up Words: The Delicious Limits of Tokenization

Why does it stop mid-word?

Why is tokenization bad at math?

So, Why Should You Care?

Comments

More from this blog