How LLM processes text in backend

What is a Token?

Before a Large Language Model (LLM) can understand or generate text, it first needs to break it down into smaller units called tokens.

A token is not always a word. It can be:

A full word (apple)
A subword (app + le)
A character (a, p, p, l, e)
Even punctuation (. , !)

Example

Plain Text
Input: "ChatGPT is amazing!"
Tokens: ["Chat", "G", "PT", " is", " amazing", "!"]

Different models tokenize differently depending on their tokenizer.

How Words Are Tokenized

Tokenization is the process of converting raw text into tokens. Most modern LLMs use subword tokenization techniques like:

Byte Pair Encoding (BPE)
WordPiece
SentencePiece

Why Subwords?

Because:

It reduces vocabulary size
Handles unknown words better
Works across multiple languages

Example with Subwords

Plain Text
Word: "unbelievable"

Tokenized:
["un", "believ", "able"]

Behind the Scenes

Text is normalized (lowercase, remove extra spaces)
Tokenizer matches known patterns
Splits into smallest meaningful units

What is Token Training?

Token training refers to how models learn relationships between tokens during training.

Key Idea

The model is trained to predict the next token given previous tokens.

Example

Plain Text
Input: "The sky is"
Target: "blue"

During training:

Input tokens → model
Model predicts probability distribution
Loss is calculated (difference from actual token)
Weights are updated using backpropagation

This process is repeated over billions of tokens.

What is PyTorch and TensorFlow?

These are the two most popular deep learning frameworks used to build LLMs.

PyTorch

Developed by Facebook (Meta)
Dynamic computation graph
Easy debugging
Widely used in research

TensorFlow

Developed by Google
Static + dynamic graphs
Production-ready tools
Strong deployment ecosystem

Example (PyTorch)

Python
import torch
import torch.nn as nn

linear = nn.Linear(10, 5)
input = torch.randn(1, 10)
output = linear(input)
print(output)

Parts of a Transformer

Transformers are the core architecture behind LLMs.

Main components:

Token Embedding
Positional Encoding
Multi-Head Attention
Feed Forward Network
Transformer Block (stacked layers)

Token Embedding

Tokens are integers. Models cannot understand integers directly.

So we convert tokens into vectors.

Example

Plain Text
Token ID: 101 → [0.21, -0.33, 0.89, ..., 0.12]

This vector:

Captures meaning
Places similar words closer in vector space

Positional Encoding

Transformers do not understand order naturally.

So we add positional information.

Why?

Plain Text
"dog bites man" ≠ "man bites dog"

Solution

Add positional encoding to embeddings.

Plain Text
Final Input = Token Embedding + Positional Encoding

This helps model understand sequence order.

Multi-Head Attention

This is the most important part of transformers.

Core Idea

Each token looks at other tokens to understand context.

Example

Plain Text
Sentence: "The bank of the river"

"bank" attends to "river" → meaning = river bank

How It Works

Create Query (Q), Key (K), Value (V)
Compute attention scores:

Plain Text
Attention(Q, K, V) = softmax(QK^T / sqrt(d)) V

Multiple heads → multiple perspectives

Feed Forward Network (FFN)

After attention, data passes through a neural network.

Structure

Plain Text
FFN(x) = max(0, xW1 + b1)W2 + b2

Applies non-linearity
Helps model learn complex patterns

Transformer Block

A transformer block combines:

Multi-head attention
Add & Norm
Feed Forward
Add & Norm

Flow

Plain Text
Input
  ↓
Multi-Head Attention
  ↓
Add & Normalize
  ↓
Feed Forward
  ↓
Add & Normalize
  ↓
Output

LLMs stack dozens or hundreds of these blocks.

How Generation Works

Now the most exciting part: text generation

Step-by-Step

1. Input Prompt

Plain Text
"Once upon a time"

2. Tokenization

Plain Text
["Once", " upon", " a", " time"]

3. Forward Pass

Tokens → embeddings
Pass through transformer layers
Output = probability distribution over vocabulary

4. Next Token Prediction

Plain Text
Possible outputs:
"there" → 40%
"was" → 35%
"a" → 10%

5. Sampling Strategy

Greedy (pick highest)
Top-k
Top-p (nucleus sampling)
Temperature scaling

6. Append Token

Plain Text
"Once upon a time there"

7. Repeat

This loop continues until:

End token is reached
Max length is hit

Example Generation Loop

Python
tokens = tokenizer("Once upon a time")

for _ in range(max_length):
    logits = model(tokens)
    next_token = sample(logits)
    tokens.append(next_token)

text = tokenizer.decode(tokens)

Key Insight

LLMs do not "think" like humans.

They:

Predict next tokens
Based on patterns
Learned from massive data

Yet, this simple mechanism leads to:

Conversations
Code generation
Creativity

Final Thoughts

Understanding LLM internals reveals:

It's all math + probability
Transformers enable context understanding
Token prediction powers everything

From a simple next-token prediction system emerges something that feels intelligent.

And that is the beauty of modern AI.

~/ How LLM processes text in backend

Table of Content

Plain Text

Plain Text

Plain Text

Python

Plain Text

Plain Text

Plain Text

Plain Text

Plain Text

Plain Text

Plain Text

1. Input Prompt

Plain Text

2. Tokenization

Plain Text

3. Forward Pass

4. Next Token Prediction

Plain Text

5. Sampling Strategy

6. Append Token

Plain Text

7. Repeat

Python

Table of Content