ch03: `MultiHeadAttention` vs. `MultiHeadAttentionWrapper` context vectors computation efficiency #559

pmilliotte · 2025-03-06T09:18:03Z

pmilliotte
Mar 6, 2025

Thanks for the great book and resources!

In Chapter 03 (page 90), you mention:

MultiHeadAttention is more efficient than MultiHeadAttentionWrapper because it only needs one matrix multiplication to compute the keys (...). In MultiHeadAttentionWrapper we need to repeat this matrix multiplication.

However, based on my understanding, isn't the key matrix in MultiHeadAttention num_heads times larger than the key matrices in each of the num_heads CausalAttention instantiated in MultiHeadAttentionWrapper?

To investigate, I measured the computation time for generating context_vectors with both solutions, using the following input dimensions (I'm on an M3 Pro, and the results were similar for both cpu and mps devices):

import torch
import torch.nn as nn

input_1 = torch.rand((10000, 12000))  # 1 vector of 10000 tokens embedded in 12000 dimension
input_2 = torch.rand((10000, 12000)) 
batch = torch.stack((input_1, input_2), dim=0)  # stack the inputs

batch_size, context_length, d_in = batch.shape # batch_size = 2, context_length = 10000, d_in = 12000
d_out = 1000
num_heads = 10

# Class defined in ch03.ipynb
mha_sequencial = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=num_heads
)
# Class defined in ch03.ipynb
# For comparable results, the second parameter is d_out * num_heads
mha_concatenated = MultiHeadAttention(d_in, d_out * num_heads, context_length, 0.0, num_heads=num_heads)

%timeit mha_sequencial(batch)
%timeit mha_concatenated(batch)

The results I obtained suggest the opposite of what was stated, with MultiHeadAttention being about twice as slow:

23.5 s ± 374 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
47.4 s ± 1.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Could you help clarify if I made any mistakes? If MultiHeadAttention is indeed more efficient, are there additional factors at play besides the fact that it only requires one key matrix (along with query and value)?

Thanks!

Answered by rasbt

Mar 15, 2025

Hi there,

I am glad you are liking the book, and that's a good question regarding

However, based on my understanding, isn't the key matrix in MultiHeadAttention num_heads times larger than the key matrices in each of the num_heads CausalAttention instantiated in MultiHeadAttentionWrapper?

You are right here. But a large matrix multiplication is often more efficient than multiple smaller matrix multiplications that follow each other sequentially. I think in your case that's because your matrices are now too large 😅.

I remember testing it with the training code in Chapter 5 and found that the MultiHeadAttentionWrapper was definitely slower. (You can try this experiment by opening the Ch05…

View full answer

rasbt · 2025-03-15T21:24:48Z

rasbt
Mar 15, 2025
Maintainer

Hi there,

I am glad you are liking the book, and that's a good question regarding

However, based on my understanding, isn't the key matrix in MultiHeadAttention num_heads times larger than the key matrices in each of the num_heads CausalAttention instantiated in MultiHeadAttentionWrapper?

You are right here. But a large matrix multiplication is often more efficient than multiple smaller matrix multiplications that follow each other sequentially. I think in your case that's because your matrices are now too large 😅.

I remember testing it with the training code in Chapter 5 and found that the MultiHeadAttentionWrapper was definitely slower. (You can try this experiment by opening the Ch05 code and in the GPTModel changing MultiHeadAttention to MultiHeadAttentionWrapper. No other change should be needed.

Btw if you are curious, I also have some comparisons (but with smaller toy examples) here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch03/02_bonus_efficient-multihead-attention

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch03: `MultiHeadAttention` vs. `MultiHeadAttentionWrapper` context vectors computation efficiency #559

{{title}}

Replies: 1 comment

{{title}}

Select a reply

ch03: MultiHeadAttention vs. MultiHeadAttentionWrapper context vectors computation efficiency #559

pmilliotte Mar 6, 2025

Replies: 1 comment

rasbt Mar 15, 2025 Maintainer

ch03: `MultiHeadAttention` vs. `MultiHeadAttentionWrapper` context vectors computation efficiency #559

pmilliotte
Mar 6, 2025

rasbt
Mar 15, 2025
Maintainer