Deep Learning 10 min read

Understanding Attention: The Core of Modern AI

Explore the fundamental concept of attention, its mathematical underpinnings, and its pivotal role in the evolution of large language models.

Anubhav Banerjee
Deep Learning Engineer & Software Developer

"Attention" as the name suggests is a very simple yet powerful method. As humans when we do a task, based on what the current stage is where we are in our task, our focus changes. For instance, if you are making a short film. While in the writing phase you would give more attention to literary sources, writing styles, dialogues etc. While in the shooting phase you would give more "attention" to camera attributes like "shutter speed", "exposure", "frame rate" etc.

What I want to convey here is that no matter what you do, depending upon what you are currently doing or focusing on, the importance that you give to other aspects of the project directly depends upon how much influence that would have on the present task you are doing at that very moment.

This my friend is all that "Attention" really boils down to. A weighted average of all other aspects of the task, given the task in hand, and how much the other tasks influence or are related to the present task.

Going from Words to Numbers

Now let me take you from this task-based intuition into the language that Computers understand: "Numbers!!" When we read a sentence like "The Child did not cross the road because it was too busy," we humans instantly know that "it" refers to the "child" - not the street. We give more attention to "child" when interpreting "it."

This is exactly what Attention does in transformers. For every word (or token) in a sequence, it calculates how much attention to pay, or importance to give to every other word, taking a weighted average as I mentioned earlier.

Example:

In 2A + 3B = C

We will say that C gave more attention to B than A, as 3 > 2.

But here is the thing: before we can even start computing these weighted averages for words, we have to convert these into something numerical. Something machines can actually work with. For this, what we do is represent every token by a vector, basically just a list of numbers.

Why Vectors?

Words are complex at their core - they capture context, meaning, emotions, and much more. The more numbers we use to represent them, the better our chances of capturing their complex nature. Early LLM's like GPT-2 used around 1,200 numbers to represent tokens. Today's models like Deepseek-v3 and GPT-5 use around 16,500 numbers (embedding dimension) for each token.

Vector Representations

Word Vector
"Child" [15, 24, 1]

Similar Words

Plant: [1, 2, 3]
Tree: [1, 4, 6]

Different Word

Computer: [100, 150, 260]

Modern LLMs

GPT-2: ~1,200 dimensions
GPT-5: ~16,500 dimensions

But here is where it gets interesting. These vectors are not just random numbers. They are learned representations that encode the semantic essence of words through numbers. This means similar words that share related meanings end up having vectors that cluster together in this abstract high-dimensional space.

The Dot Product and Softmax

If we want to figure out whether two words share similar meanings or are closer to each other in context, like "it" and "Child" from earlier, we just need to measure how similar their vector representations are. And guess what? Mathematics has gifted us the perfect tool for exactly this: the dot product.

The Process

Step 1 Take dot product of current vector with every other word
Step 2 Normalize scores using softmax function
Step 3 Compute weighted sum of all vectors

But wait. Before we take the weighted sum, what we do first is we normalize all the scores and bring them in the range of 0-1, using the softmax function. Why? Because raw scores can be negative as well, and some might just be too large and some too small creating huge discrepancies. Softmax brings them all within 0-1, while still keeping their relative magnitude differences.

Softmax Example

Original scores: [3, 6, 7]
After softmax: [0.013, 0.265, 0.721]

Notice how the relative order is preserved, just normalized to probabilities

The Final Vector

Now the final vector for say the token "it" becomes a weighted sum of all the vectors, including "it" itself (obviously it's going to give itself a higher weight). But what it does is it finally creates a condensed and morphed representation of the token "it", which now captures not just the meaning of this word in isolation, but also the context in which it is used - kind of like encoding within it the information about the child and the road being busy.

Key Insight

What we have talked about so far is the fundamental concept of how Attention works. But this is just the beginning. What we use in real life transformers is just a more advanced version called "Multi-Head Latent Attention", which I will explain in my next blog.

Also if you don't know the Softmax function, I highly recommend studying it. It's super simple, and yet arguably one of the most important mathematical functions out there, without which the ML of today is simply not possible.

Cheers!! 🎉

For the curious: If you're interested in the mathematical details, I highly recommend reading the original Transformer paper "Attention Is All You Need". The elegance of the approach becomes even more apparent when you see how the mathematical formulation leads to the relative position dependence.

What's Next?

Explaing Multi-Head Self Attention

Building in Public

Are you also building machine learning systems from scratch? I'd love to hear about your journey and exchange insights.

Get in Touch