#gpu
1 paper
-
inspiration
Flash Attention Is an IO Problem
Standard attention is slow not because of arithmetic — it is slow because of memory traffic. Flash Attention solves the IO problem, not the compute problem. That distinction matters for how you think about every inference optimization that follows.