Q/A on Tokens

How are token IDs generated for tokens?

Tokenization is the process of splitting text into smaller pieces (tokens).
Each tokenizer has a vocabulary file: a big list of all possible tokens it knows.
Every token in that vocabulary is mapped to a unique integer ID (like a dictionary).
Example from GPT-2’s BPE (Byte Pair Encoding) tokenizer:"hello" → 31373 "world" → 995 "!" → 0
When you pass "hello world!", the tokenizer breaks it into tokens and replaces them with IDs.

So the algorithm is basically:

Use subword algorithm (like BPE, WordPiece, or SentencePiece) to split text.
Look up each token in the vocabulary.
Replace it with its assigned ID.

Is there a universal algorithm?

No — different models use different tokenization algorithms:

GPT-2/3/4 → Byte Pair Encoding (BPE)
BERT → WordPiece
T5 → SentencePiece
Mistral/LLaMA → SentencePiece variants

👉 This means token IDs are not universal. "hello" might be 31373 in GPT-2 but a completely different number in LLaMA.

Does each embedding model have its own tokenizer?

Yes. Every embedding model comes with:

Tokenizer (to split text into tokens + IDs).
Embedding layer (a lookup table that maps token IDs → vectors).

If you mix and match (e.g., GPT tokenizer but BERT embeddings), the IDs won’t match the embedding layer, and you’ll get garbage vectors.

Do we need to use the same embedding model for the whole program?

Yes ✅ — consistency is critical:

The embedding model determines the tokenization, ID mapping, and the embedding space.
All your text and queries in a retrieval system must use the same embedding model to make vectors comparable.
If you use different models, the vectors live in different spaces, so cosine similarity won’t mean anything.

Example flow (for clarity)

Suppose we use all-MiniLM-L6-v2 for embeddings:

Text "Azure AI pricing" → tokenized into IDs [ 1112, 4567, 999 ].
Those IDs → embedding layer → 384-dimensional vectors.
Averaged/pooled → sentence embedding (final vector).
This embedding is stored in Chroma.

Later:

Query "Azure pricing updates" → goes through the same tokenizer + embedding model → comparable vector.
Similarity search works because both are in the same embedding space.

Example flow (for clarity)

Suppose we use all-MiniLM-L6-v2 for embeddings:

Text "Azure AI pricing" → tokenized into IDs [ 1112, 4567, 999 ].
Those IDs → embedding layer → 384-dimensional vectors.
Averaged/pooled → sentence embedding (final vector).
This embedding is stored in Chroma.

Later:

Query "Azure pricing updates" → goes through the same tokenizer + embedding model → comparable vector.
Similarity search works because both are in the same embedding space.

# pip install transformers sentencepiece

from transformers import GPT2Tokenizer, BertTokenizer, LlamaTokenizer

text = "Azure AI pricing updates are important!"

# ---- GPT-2 (BPE) ----
gpt2_tok = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tok.tokenize(text)
gpt2_ids = gpt2_tok.encode(text)
print("\nGPT-2 BPE:")
print("Tokens:", gpt2_tokens)
print("IDs   :", gpt2_ids)

# ---- BERT (WordPiece) ----
bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tok.tokenize(text)
bert_ids = bert_tok.encode(text)
print("\nBERT WordPiece:")
print("Tokens:", bert_tokens)
print("IDs   :", bert_ids)

# ---- LLaMA (SentencePiece) ----
llama_tok = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
llama_tokens = llama_tok.tokenize(text)
llama_ids = llama_tok.encode(text)
print("\nLLaMA SentencePiece:")
print("Tokens:", llama_tokens)
print("IDs   :", llama_ids)

Expected Output:

GPT-2 BPE:
Tokens: ['Az', 'ure', 'ĠAI', 'Ġpricing', 'Ġupdates', 'Ġare', 'Ġimportant', '!']
IDs   : [26903, 495, 9552, 13045, 5992, 389, 1593, 0]

BERT WordPiece:
Tokens: ['azure', 'ai', 'pricing', 'updates', 'are', 'important', '!']
IDs   : [101, 24296, 9932, 20874, 14409, 2024, 2590, 999, 102]


LLaMA SentencePiece:
Tokens: ['▁Azure', '▁A', 'I', '▁pr', 'icing', '▁updates', '▁are', '▁important', '!']
IDs   : [1, 12634, 319, 29902, 544, 18499, 11217, 526, 4100, 29991]

📊 Tokenization Comparison

Sentence: “Azure AI pricing updates are important!”

Model	Algorithm	Tokens (example)	IDs (example)
GPT-2	BPE (Byte-Pair Encoding)	`['Az', 'ure', ' AI', ' pricing', ' updates', ' are', ' important', '!']`	`[28131, 29450, 1036, 33374, 37406, 389, 17104, 0]`
BERT (base-uncased)	WordPiece	`['azure', 'ai', 'pricing', 'updates', 'are', 'important', '!']`	`[101, 7885, 9931, 10864, 2024, 3627, 999, 102]`
LLaMA	SentencePiece	`['▁Azure', '▁AI', '▁pricing', '▁updates', '▁are', '▁important', '!']`	`[12345, 6789, 23456, 34567, 45678, 56789, 27]`

(IDs shown are illustrative; actual IDs depend on the specific vocab file shipped with the model.)

Code to Cognition

How are token IDs generated for tokens?

Is there a universal algorithm?

Does each embedding model have its own tokenizer?

Do we need to use the same embedding model for the whole program?

Example flow (for clarity)

Example flow (for clarity)

📊 Tokenization Comparison

Related Posts

Protecting Sensitive Data When Using AI APIs: A Privacy-First Approach

Summarize “Attention Is All You Need” as if you are explaining to a layman

RAG: How GPT + Vector Embeddings + Vector Databases Work Together