Replacing GPU Compute Dies With PNM-Enabled HBM Cubes For Long-Context Decode Attention (UCSD, Columbia, Yonsei U., NVIDIA, Samsung)
Researchers from UCSD, Columbia, NVIDIA, and Samsung propose AMMA, a memory-centric architecture that replaces GPU compute dies with HBM-enabled processing-in-memory cubes to cut latency for long-context attention (1M tokens). The design doubles available memory bandwidth and introduces a hybrid parallelism scheme tailored to the memory-bound nature of decode-phase inference, addressing a critical bottleneck as agentic workloads push toward million-token contexts.
Semiconductor Eng · 6 min read
Research