Skip to content

ch03: MultiHeadAttention vs. MultiHeadAttentionWrapper context vectors computation efficiency #559

Answered by rasbt
pmilliotte asked this question in Q&A
Discussion options

You must be logged in to vote

Hi there,

I am glad you are liking the book, and that's a good question regarding

However, based on my understanding, isn't the key matrix in MultiHeadAttention num_heads times larger than the key matrices in each of the num_heads CausalAttention instantiated in MultiHeadAttentionWrapper?

You are right here. But a large matrix multiplication is often more efficient than multiple smaller matrix multiplications that follow each other sequentially. I think in your case that's because your matrices are now too large 😅.

I remember testing it with the training code in Chapter 5 and found that the MultiHeadAttentionWrapper was definitely slower. (You can try this experiment by opening the Ch05…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by pmilliotte
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants