Grouped-Query Attention (GQA) is a transformer attention variant that reduces inference memory and compute by sharing key and value projections across groups of query heads. Instead of each query head attending with its own independent key and value heads, multiple query heads reuse a smaller number of key/value heads, lowering KV cache size and speeding up decoding.
What is GQA (Grouped-Query Attention)?
In multi-head attention, each head has query, key, and value projections. During autoregressive inference, keys and values for past tokens are stored in the KV cache. KV cache memory grows with the number of attention heads, the number of layers, and the context length.
GQA changes the head structure. It keeps many query heads for expressivity, but uses fewer key/value heads. Query heads are partitioned into groups, and all queries in a group attend using the same key and value head outputs. This reduces the amount of cached key/value tensors, often by a factor equal to the grouping ratio, while keeping much of the quality benefit of multi-head queries.
GQA generalizes multi-query attention (MQA). MQA is the extreme case where all query heads share a single key head and a single value head. GQA provides a middle ground: it offers a better quality–efficiency trade-off than MQA, especially for large models and long contexts.
Where it’s used and why it matters
GQA is used in modern LLM architectures and serving stacks to improve throughput and enable longer context windows. It matters most for inference, where KV cache is often the dominant GPU memory consumer. Smaller KV cache improves concurrency, reduces paging pressure, and lowers cost per token. GQA can also help with latency because attention reads fewer key/value tensors per step.
Types
- MQA (Multi-Query Attention): one KV head shared across all queries.
- GQA: several KV heads shared across query head groups.
- Standard MHA: each query head has its own KV heads.
FAQs
- Does GQA reduce model quality?
It can, depending on the grouping ratio, but many models preserve quality well with moderate grouping. - Why is KV cache smaller with GQA?
Because the number of cached key/value heads is reduced while keeping many query heads. - Is GQA a training-only trick?
No. It is an architectural choice that primarily benefits inference and serving. - When should you prefer GQA over MQA?
When MQA quality loss is too high, GQA often recovers quality with a modest memory increase.