GitHub - xlite-dev/ffpa-attn: 🤖FFPA: Extend FlashAttention-2 w/ Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA.

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📚FFPA(Split-D) Blog | 📈L20 ~1.9x↑🎉 | 📈A30 ~1.8x↑🎉 | 📈3080 ~2.9x↑🎉 | 📈4090 ~2.1x↑🎉

FFPA(Split-D): Yet another Faster Flash Prefill Attention with Split-D strategy, achieve O(1) SRAM complexity and O(d/4) register complexity for large headdim (> 256), 1.8x~3x 🎉 faster than SDPA. Currently, FFPA supports self-attention, cross-attention, grouped/multi-query attention, causal attention with large headdim (D=320~1024). While the standard FlashAttention-2 only support headdim <= 256.

Self Attention	Cross/Decode Attention	GQA/MQA Attention	Causal Attention	Headdim
✔️(`Nq = Nkv`)	✔️(`Nq != Nkv`)	✔️(`Nh_q % Nh_kv == 0`)	✔️(`causal mask`)	320~1024

Note

FFPA has so far only been tested on Ampere and Ada architectures (e.g. A30, RTX 3080, L20, RTX 4090). It should also work on Hopper and other newer architectures, but performance may not be optimal there since FFPA does not yet leverage TMA for further optimization.

📖 Quick Start

First, clone the repo and build the package from source: (Note: pip uninstall ffpa-attn -y if you want to reinstall after code changes; recommended: PyTorch>=2.11.0, CUDA>=13.0).

git clone https://github.com/xlite-dev/ffpa-attn.git
export MAX_JOBS=32 && python3 setup.py bdist_wheel
pip3 install dist/*.whl # pip uninstall ffpa-attn -y

Note

FFPA supports cross-attention where the query seqlen Nq may differ from the key/value seqlen Nkv, GQA / MQA attention where Q has Nh_q heads and K/V have Nh_kv heads (requires Nh_q % Nh_kv == 0; group size = Nh_q / Nh_kv), and causal attention (pass causal=True; queries are aligned to the KV tail, i.e. Q row r attends to k <= r + (Nkv - Nq), which requires Nkv >= Nq). K/V must share the same Nh_kv and Nkv.

Minimal usage example — Self-Attention (B=1, H=32, N=8192, D=512):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# D: 32, 64, ..., 320, ..., 1024 (FA-2 <= 256, FFPA supports up to 1024).
B, H, N, D = 1, 32, 8192, 512 # batch_size, num_heads, seq_len, head_dim
q = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")

# FFPA self attention; layout follows SDPA: (B, H, N, D).
out = ffpa_attn_func(q, k, v)  # -> torch.Tensor of shape (B, H, N, D)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Cross-Attention or Decoding-Attention example (short query, long KV cache; Nq != Nkv):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# Short-query / long-KV, e.g. incremental decoding or cross-attention:
# Q: [B, H, Nq, D], K/V: [B, H, Nkv, D]; Nq can differ from Nkv but Nk==Nv required.
B, H, D = 1, 8, 512
Nq, Nkv = 128, 8192
q = torch.randn(B, H, Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v)  # -> (B, H, Nq, D) = (1, 8, 128, 512)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Grouped-Query / Multi-Query Attention example (Q has more heads than K/V):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# GQA: Q has Nh_q heads, K/V share Nh_kv heads; group_size = Nh_q / Nh_kv.
# Typical Llama-3-style 32/8 ratio; MQA is the Nh_kv==1 special case.
# FFPA targets large headdim so we use D=512 here (FA-2 tops out at D=256).
B, D, Nq, Nkv = 1, 512, 1024, 4096
Nh_q, Nh_kv = 32, 8  # group_size = 4
q = torch.randn(B, Nh_q,  Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, Nh_kv, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, Nh_kv, Nkv, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v)  # -> (B, Nh_q, Nq, D) = (1, 32, 1024, 512)
print(out.shape, out.dtype)

# Reference: replicate K/V along head dim to match Q's head count.
group_size = Nh_q // Nh_kv
k_ref = k.repeat_interleave(group_size, dim=1)
v_ref = v.repeat_interleave(group_size, dim=1)
ref = F.scaled_dot_product_attention(q, k_ref, v_ref)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

Causal Attention example (self-attention causal; also supports chunked / decoding prefill with Nkv > Nq):

import torch
import torch.nn.functional as F
from ffpa_attn import ffpa_attn_func

# Causal self-attention: Q row r attends to k <= r (standard triangular mask).
# FFPA is tuned for large headdim, so we keep D=512 as in the self-attn example.
B, H, N, D = 1, 8, 4096, 512
q = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, N, D, dtype=torch.bfloat16, device="cuda")

out = ffpa_attn_func(q, k, v, causal=True)
print(out.shape, out.dtype)

ref = F.scaled_dot_product_attention(q, k, v, is_causal=True)
print(f"vs SDPA max_abs_err={(out - ref).abs().max().item():.4e}")

# Chunked / decoding prefill: Nq < Nkv, queries aligned to the KV tail
# so Q row r attends to k <= r + (Nkv - Nq). Requires Nkv >= Nq.
Nq, Nkv = 128, 8192
q = torch.randn(B, H, Nq,  D, dtype=torch.bfloat16, device="cuda")
k = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
v = torch.randn(B, H, Nkv, D, dtype=torch.bfloat16, device="cuda")
out = ffpa_attn_func(q, k, v, causal=True)
print(out.shape, out.dtype)  # (1, 8, 128, 512)

A runnable end-to-end example (with SDPA accuracy/perf comparison, both aligned N=8192 and non-aligned N=8191 cases) is provided under examples/run_ffpa_attn.py:

CUDA_VISIBLE_DEVICES=0 python3 examples/run_ffpa_attn.py

📖 Fine-grained Tiling at MMA level

We have extended FlashAttention for large headdim (D > 256) by implementing Fine-grained Tiling at the MMA level (GEMM style) for the Q@K^T and P@V matmul. This approach results in a constant SRAM usage of Br * 16 or Bc * 16 (Br = Bc) for Q, K, and V, leading to an overall SRAM complexity of O(2 * Br * 16) ≈ O(1) and a register complexity of O(d/4). Consequently, this method allows us to extend headdim beyond 256 and achieve faster performance compared to SDPA with or without MMA Accumulation F32 (1.8x~3x 🎉 faster than SDPA EA).

We have named this new attention tiling technique FFPA: Faster Flash Prefill Attention. FFPA does not introduce any additional VRAM requirement, so the HBM memory complexity remains the same as FlashAttention.

By leveraging this approach, we can achieve better performance than SDPA EA for very large headdim (D > 256, FA-2 not supported). Approximate SRAM and register complexity analysis for FFPA is as follows: (d=headdim, C,Br,Bc=Constant, Br=Bc, let O(C)≈O(1)) 👇

📚Complexity Analysis	📚FFPA Attention (Split-D)	📚FlashAttention-2
SRAM	O(2xBrx16)≈O(1)	≈O(3xBrxd), d↑
Register	≈O(d/4), d↑	≈O(d/2), d↑
HBM	≈FA2≈O(Nd), O	≈O(Nd), O
Extra HBM	≈FA2≈O(N), m,l	≈O(N), m,l

📚Implementation: FFPA is implemented using pure MMA PTX instructions, which supports many features such as Split-Q, SMEM Swizzle/Padding, QKV Multi-Stages(1~4), Tile MMAs/Warps, Mixed MMA F32/F16 Acc (Q@K^T MMA Acc F32 + P@V MMA Acc F16), Fully Shared QKV SMEM, Prefetch QKV g2s, Persist Q s2r/g2s, Fully QKV Fine-grained Tiling(GEMM style), Collective Store, etc.

✔️Tensor Cores	✔️MMA(m16n8k16)	✔️Tile Block(Br, Bc)	✔️Tile MMA/Warp
✔️Split Q(FA-2)	✔️Pack LDST(128 bits)	✔️SMEM Swizzle/Pad	✔️Copy Async
✔️Reg Double Buffers	✔️QKV Multi-Stages(1~4)	✔️Collective Store(Shfl)	✔️Prefetch QKV g2s
✔️QKV Fine-grained Tiling	✔️Shared QKV SMEM	✔️Mixed MMA Acc	✔️Persist Q s2r/g2s

©️License

GNU General Public License v3.0

🎉Contribute

How to contribute? Wecome to star⭐️ this repo to support me👆🏻 ~

©️Citations

@misc{ffpa-attn@2025,
  title={FFPA: Yet another Faster Flash Prefill Attention for large headdim.},
  url={https://github.com/xlite-dev/ffpa-attn.git},
  note={Open-source software available at https://github.com/xlite-dev/ffpa-attn.git},
  author={DefTruth},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 269 Commits
.dev		.dev
.github		.github
bench		bench
csrc		csrc
docs/assets		docs/assets
examples		examples
ffpa_attn		ffpa_attn
include		include
tests		tests
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
env.py		env.py
last_30.txt		last_30.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tail_output.txt		tail_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

📖 Quick Start

📖 Fine-grained Tiling at MMA level

©️License

🎉Contribute

©️Citations

📖 References

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖FFPA: Yet another Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for large headdim🐑

📖 Quick Start

📖 Fine-grained Tiling at MMA level

©️License

🎉Contribute

©️Citations

📖 References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🤖FFPA: Yet another Faster Flash Prefill Attention
with O(1)⚡️GPU SRAM complexity for large headdim🐑

Packages