How are token IDs generated for tokens?
- Tokenization is the process of splitting text into smaller pieces (tokens).
- Each tokenizer has a vocabulary file: a big list of all possible tokens it knows.
- Every token in that vocabulary is mapped to a unique integer ID (like a dictionary).
- Example from GPT-2’s BPE (Byte Pair Encoding) tokenizer:
"hello" → 31373 "world" → 995 "!" → 0 - When you pass
"hello world!", the tokenizer breaks it into tokens and replaces them with IDs.
So the algorithm is basically:
- Use subword algorithm (like BPE, WordPiece, or SentencePiece) to split text.
- Look up each token in the vocabulary.
- Replace it with its assigned ID.
Is there a universal algorithm?
No — different models use different tokenization algorithms:
- GPT-2/3/4 → Byte Pair Encoding (BPE)
- BERT → WordPiece
- T5 → SentencePiece
- Mistral/LLaMA → SentencePiece variants
👉 This means token IDs are not universal. "hello" might be 31373 in GPT-2 but a completely different number in LLaMA.
Does each embedding model have its own tokenizer?
Yes. Every embedding model comes with:
- Tokenizer (to split text into tokens + IDs).
- Embedding layer (a lookup table that maps token IDs → vectors).
If you mix and match (e.g., GPT tokenizer but BERT embeddings), the IDs won’t match the embedding layer, and you’ll get garbage vectors.
Do we need to use the same embedding model for the whole program?
Yes ✅ — consistency is critical:
- The embedding model determines the tokenization, ID mapping, and the embedding space.
- All your text and queries in a retrieval system must use the same embedding model to make vectors comparable.
- If you use different models, the vectors live in different spaces, so cosine similarity won’t mean anything.
Example flow (for clarity)
Suppose we use all-MiniLM-L6-v2 for embeddings:
- Text
"Azure AI pricing"→ tokenized into IDs[ 1112, 4567, 999 ]. - Those IDs → embedding layer → 384-dimensional vectors.
- Averaged/pooled → sentence embedding (final vector).
- This embedding is stored in Chroma.
Later:
- Query
"Azure pricing updates"→ goes through the same tokenizer + embedding model → comparable vector. - Similarity search works because both are in the same embedding space.
Example flow (for clarity)
Suppose we use all-MiniLM-L6-v2 for embeddings:
- Text
"Azure AI pricing"→ tokenized into IDs[ 1112, 4567, 999 ]. - Those IDs → embedding layer → 384-dimensional vectors.
- Averaged/pooled → sentence embedding (final vector).
- This embedding is stored in Chroma.
Later:
- Query
"Azure pricing updates"→ goes through the same tokenizer + embedding model → comparable vector. - Similarity search works because both are in the same embedding space.
# pip install transformers sentencepiece
from transformers import GPT2Tokenizer, BertTokenizer, LlamaTokenizer
text = "Azure AI pricing updates are important!"
# ---- GPT-2 (BPE) ----
gpt2_tok = GPT2Tokenizer.from_pretrained("gpt2")
gpt2_tokens = gpt2_tok.tokenize(text)
gpt2_ids = gpt2_tok.encode(text)
print("\nGPT-2 BPE:")
print("Tokens:", gpt2_tokens)
print("IDs :", gpt2_ids)
# ---- BERT (WordPiece) ----
bert_tok = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokens = bert_tok.tokenize(text)
bert_ids = bert_tok.encode(text)
print("\nBERT WordPiece:")
print("Tokens:", bert_tokens)
print("IDs :", bert_ids)
# ---- LLaMA (SentencePiece) ----
llama_tok = LlamaTokenizer.from_pretrained("hf-internal-testing/llama-tokenizer")
llama_tokens = llama_tok.tokenize(text)
llama_ids = llama_tok.encode(text)
print("\nLLaMA SentencePiece:")
print("Tokens:", llama_tokens)
print("IDs :", llama_ids)
Expected Output:
GPT-2 BPE:
Tokens: ['Az', 'ure', 'ĠAI', 'Ġpricing', 'Ġupdates', 'Ġare', 'Ġimportant', '!']
IDs : [26903, 495, 9552, 13045, 5992, 389, 1593, 0]
BERT WordPiece:
Tokens: ['azure', 'ai', 'pricing', 'updates', 'are', 'important', '!']
IDs : [101, 24296, 9932, 20874, 14409, 2024, 2590, 999, 102]
LLaMA SentencePiece:
Tokens: ['▁Azure', '▁A', 'I', '▁pr', 'icing', '▁updates', '▁are', '▁important', '!']
IDs : [1, 12634, 319, 29902, 544, 18499, 11217, 526, 4100, 29991]
📊 Tokenization Comparison
Sentence: “Azure AI pricing updates are important!”
| Model | Algorithm | Tokens (example) | IDs (example) |
|---|---|---|---|
| GPT-2 | BPE (Byte-Pair Encoding) | ['Az', 'ure', ' AI', ' pricing', ' updates', ' are', ' important', '!'] | [28131, 29450, 1036, 33374, 37406, 389, 17104, 0] |
| BERT (base-uncased) | WordPiece | ['azure', 'ai', 'pricing', 'updates', 'are', 'important', '!'] | [101, 7885, 9931, 10864, 2024, 3627, 999, 102] |
| LLaMA | SentencePiece | ['▁Azure', '▁AI', '▁pricing', '▁updates', '▁are', '▁important', '!'] | [12345, 6789, 23456, 34567, 45678, 56789, 27] |
(IDs shown are illustrative; actual IDs depend on the specific vocab file shipped with the model.)
