The loss function is how LLMs learn in the first place
At the heart of every machine learning model, from simple linear regression to massive LLMs like GPT-4, is a loss function.
It measures how wrong the model’s predictions are.
For language models, the most common loss function is cross-entropy loss, which measures how well the model’s predicted probability distribution over words matches the actual next word.
Example with an LLM
During the training, let’s say the model sees the following sentence and have to predict the next word:
"The cat sat on the ____"
It might predict:
{"mat": 0.6, "table": 0.2, "sofa": 0.1, ...}
The true word is "mat", so the loss function (cross-entropy) calculates a score based on how close this predicted distribution is to the perfect one-hot truth (100% probability on "mat").
The optimizer (like Adam or SGD) then tweaks billions of parameters to reduce this loss, improving future predictions.
Why is it important for LLMs?
LLMs are literally trained by minimizing this loss over trillions of tokens.
- The lower the loss, the better the model has learned to predict the next token in context.
No loss function = no way to measure improvement = no learning.
Why should you care?
Because understanding loss functions gives you insight into both the power and limitations of LLMs.
- A language model minimizes token prediction error, not “truth” or “logical consistency.”
- That’s why it can still hallucinate — its primary training goal was to guess the next token, not to guarantee factual correctness.
Takeaway:
Every amazing sentence an LLM generates — was made possible by minimizing a loss function millions of times over.
In the world of AI, loss isn’t bad. It’s how they learn.
