GPT-2's Attention Weights, Visualized

By Amanvir Parhar

Visualize the attention weight matrices of each attention block within the GPT-2 (small) model as it processes a given prompt. Attention heads are stacked upon one another on the y-axis, while token-to-token interactions are displayed on the x- and z-axes.

Drag and zoom-in to see different parts of each block. Hover over specific points to see the actual attention weight values and which query-key pairs they represent.

Please switch to a desktop with sufficient memory to use this tool.

Prepend <bos> token

Use log scale for point radii

Temperature

Top-k

Block 1

Load model to visualize attention weights.

Block 2

Load model to visualize attention weights.

Block 3

Load model to visualize attention weights.

Block 4

Load model to visualize attention weights.

Block 5

Load model to visualize attention weights.

Block 6

Load model to visualize attention weights.

Block 7

Load model to visualize attention weights.

Block 8

Load model to visualize attention weights.

Block 9

Load model to visualize attention weights.

Block 10

Load model to visualize attention weights.

Block 11

Load model to visualize attention weights.

Block 12

Load model to visualize attention weights.

Acknowledgements

This project was by inspired by Cho et. al's Transformer Explainer (my tool actually uses the ONNX file from this project!), Brendan Bycroft's 3D LLM, and other great ML web visualizations.

Thanks to Taylor Baldwin for providing great feedback on the initial version of this tool, as well as coming up with the idea to implement logarithmic scaling to account for attention sinks!

I want to dedicate this project to my mother, who has always been my biggest supporter and was actually the first person to get me interested in AI/ML. At the time of me writing this, the date is April 18th, 2025... which is her birthday! Happy birthday, Mama — I hope you enjoy this little project! ❤️