Forest: Access-aware GPU UVM Management

Forest: Access-aware GPU UVM Management
Author	Mao Lin, Yuan Feng, Guilherme Cox, Hyeran Jeon
Conference	ISCA
Year	2025

개요

이 논문은 GPU Unified Virtual Memory에서 기존 Tree-based Neighboring Prefetcher가 access pattern을 모른 채 동일한 tree 설정을 모든 메모리 블록에 적용할 때 왜 불필요한 page migration과 page thrashing이 발생하며, data object별 access pattern에 맞춘 heterogeneous TBNp configuration으로 이를 어떻게 줄일 수 있는지를 다룬다. GPU 하드웨어의 page access counter를 access recency 추적용으로 재해석하고, UVM driver가 그 정보를 이용해 prefetch tree와 eviction policy를 동적으로 조정한다.

Motivation

Unified Virtual Memory는 CPU memory를 GPU memory의 확장 공간처럼 사용할 수 있게 해 주지만, GPU가 CPU memory에 있는 page를 처음 접근할 때 far-fault handling과 page migration이 critical path에 들어간다. 특히 modern GPU workload와 Deep Learning model은 memory footprint가 커지고 있어 memory oversubscription 상황에서 UVM 성능이 중요해진다.

On Configuration Doesn't Fit All: NVIDIA GPU 계열에서 쓰이는 TBNp는 2MB VABlock을 full binary tree로 관리하고 64KB leaf node 단위로 migration한다. 이 방식은 locality가 강한 workload에서는 far-fault를 줄일 수 있지만, 모든 application과 모든 data object에 동일한 tree size와 leaf size를 적용한다. 논문은 실험적으로 15개 workload 중 baseline configuration인 2MB tree, 64KB leaf가 최적인 경우가 없다고 보고한다. 또한 같은 application 안에서도 kernel과 data object마다 LS, HCHI, HCLI, LC 같은 access pattern이 달라 하나의 homogeneous prefetcher로는 충분하지 않다.

TBNp, A Hidden Source of Memory Oversubscription: 또한 paper는 Prefetching accuracy와 timeliness뿐 아니라 GPU memory oversubscription에서 prefetcher가 memory pressure를 악화시킬 수 있음을 보여준다. Figure 6의 분석에 따르면 일부 application에서는 migrated page의 5%에서 48%가 eviction 전까지 전혀 access되지 않았고, 일부 workload에서는 반복 thrashing page 수가 memory footprint의 5.7x를 넘는다. 이 framing은 future UVM work에서 prefetcher와 eviction policy를 분리해 볼 수 없다는 근거로 유용하다.

Driver-driven Page Eviction and Trashing: 또 다른 문제는 eviction이다. 기존 UVM driver의 LRU는 실제 GPU-side access recency가 아니라 far-fault event 순서에 기반한다. 따라서 device에서 최근에 많이 쓰인 page라도 fault history 기준으로 오래된 것으로 보이면 eviction될 수 있고, 이후 다시 migration되며 page thrashing이 발생한다.

TBNp

본 논문은 TBNp를 버리지 않고, TBNp의 tree 구조를 access-aware하게 바꾸는 방향을 택한다. Forest의 문제의식은 다음과 같다.

TBNp 자체는 UVM far-fault를 줄이는 데 효과적이다.
하지만 fixed 2 MB tree와 64 KB leaf node는 모든 access pattern에 맞지 않는다.
UVM driver만으로는 GPU-side access pattern을 충분히 볼 수 없기 때문에, TBNp가 불필요한 migration과 thrashing을 만들 수 있다.
따라서 data object별 access pattern을 감지하고, TBNp tree size와 leaf size를 object별로 다르게 설정해야 한다.

Main Idea

핵심 아이디어는 UVM-managed data object마다 access pattern이 다르므로, 고정된 TBNp tree 하나를 쓰지 말고 object별로 tree size와 leaf size를 바꾸자는 것이다. Forest는 GPU-side access counter를 이용해 page access 순서를 관찰하고, UVM driver가 이를 네 가지 access pattern으로 분류한 뒤 각 pattern에 맞는 tree configuration을 선택한다.

Forest에서 recency가 중요한 이유는 기존 UVM driver가 실제 GPU access recency를 잘 모르기 떄문이다. 기존 driver는 page fault 순서를 기준으로 LRU 비슷하게 판단하는데, page fault가 안 난 page라도 GPU에서 계속 hot하게 접근 중일 수 있다. Forest의 ATT는 page access counter를 이용해서 “이 page/object가 실제로 언제 접근됐는가”를 기록하게 하였다. 즉 access frequency가 “몇 번 접근됐나”라면, access recency는 “마지막으로 언제 접근됐나”에 가깝다. 이를 통해서 Hardware-level의 LRU를 최적으로 구할 수 있다.

ATT를 이용해서 얻은 정보로, Pattern과 configuration의 대응을 다음과 같이 할 수 있다. 기존에는 분류를 Regularity로만 판단했는데, 본 논문에서는 Intensity도 축으로 두어, 더 Fine-grained한 분류를 하였다.

Linear/Streaming (LS): page가 순차적으로 접근되고 재사용이 낮다. 큰 tree와 큰 leaf를 사용해 aggressive prefetching을 한다.
Non-Linear High-Coverage High-Intensity (HCHI): 넓은 address range에서 많은 page가 빠르게 접근되지만 linear하지 않다. 작은 tree와 기본 leaf로 prefetch 범위를 제한한다.
Non-Linear High-Coverage Low-Intensity (HCLI): 넓은 range를 sparse하게 접근한다. 작은 tree와 작은 leaf를 사용해 불필요한 migration을 줄인다.
Non-Linear Low Coverage (LC): coverage가 낮거나 명확한 pattern이 없다. default TBNp configuration을 사용한다.

SpecForest는 Forest의 profiling delay를 줄이기 위한 확장이다. 반복 kernel의 이전 pattern을 재사용하고, compiler static analysis로 fixed-stride LS access를 미리 표시하며, indirect indexing이 비슷한 data object들을 similarity group으로 묶어 한 object의 classification 결과를 같은 group에 전파한다.

Design

Access Time Tracker (ATT): GPU GMMU 쪽에 object table을 두고 UVM object별 VPN range, access timer, recency order, cease bit를 관리한다. 기존 page access counter가 access frequency를 담는 대신, Forest에서는 object-local access timer 값을 page counter에 기록해 page access order를 나타내도록 한다. 이 설계는 device-side access recency를 driver가 읽을 수 있게 만드는 핵심 하드웨어 변경이다.
Access Pattern Detector (APD): UVM driver module로, profiling interval마다 access counter 정보를 가져와 object별 pattern을 분류한다. LS는 page number와 access time의 linear regression에서 R^2 threshold를 이용해 판정하고, HCHI/HCLI/LC는 accessed VPN range coverage와 accessed page count intensity threshold로 구분한다. pattern이 결정되면 object의 profiling을 멈추고 pattern table에 기록한다.
Prefetch Engine (PE) 확장: 기존 UVM driver의 TBNp traversal과 migration path를 유지하되, non-leaf node마다 isolation bit와 motion bit를 추가한다. isolation bit는 child subtree를 서로 독립된 prefetch tree로 분리해 tree size를 조절하고, motion bit는 child nodes를 하나의 basic block처럼 취급해 leaf size를 조절한다. 이 두 bit로 16KB 단위의 기본 partition 위에서 object별 heterogeneous tree를 만든다.
Access-aware eviction: 기존 far-fault 기반 LRU 대신 ATT의 object recency order와 page access counter의 recency 값을 이용한다. 먼저 가장 오래된 object를 찾고, 그 object 안에서 가장 오래된 page가 포함된 leaf node를 evict한다. global memory 전체를 훑는 대신 object 단위로 search space를 줄이면서 GPU-side recency를 반영한다.
SpecForest: pattern recording, static LS detection, access similarity detection을 추가한다. compiler가 UVM object의 simple stride access 또는 동일 index expression 기반 similarity group을 표시하고, driver는 이 정보를 cudaMallocManaged flag 확장 형태로 받아 초기 tree configuration을 앞당긴다.

Bringing It All Together

Kernel launch 시점에 UVM driver가 해당 kernel이 접근할 managed memory object들의 VPN range를 ATT object table에 등록한다.
Kernel이 실행되는 동안 GPU memory access는 GMMU에 붙은 ATT의 access timer와 recency order를 갱신한다.
어떤 object의 access count가 profiling threshold에 도달하면 ATT가 driver에 interrupt를 보내고, driver는 기존 access counter copy path를 이용해 object의 page access timing 정보를 가져온다.
APD가 이 정보를 LS/HCHI/HCLI/LC 중 하나로 분류한다. 만약 여러 번 profiling해도 pattern이 결정되지 않으면 Forest는 해당 object를 default LC pattern으로 처리한다.
Pattern이 결정되면 driver는 pattern table에 결과를 기록하고, PE는 해당 object의 TBNp tree에 motion bit와 isolation bit를 설정해 tree size와 leaf size를 바꾼다.
Driver는 ATT의 cease bit를 set하여 이미 pattern이 결정된 object에 대해서는 추가 profiling interrupt를 멈춘다.
이후 page fault가 발생하면 UVM driver는 새로 configured된 tree를 사용해 fault handling과 prefetching을 수행한다.
Oversubscription 때문에 eviction이 필요하면 ATT object table에서 가장 오래된 object를 찾고, 그 object 내부에서 가장 오래된 page가 포함된 leaf node를 evict한다.
즉 Forest의 runtime loop는 ATT가 access recency를 기록하고, APD가 pattern을 결정하며, PE가 object별 TBNp를 재구성하고, eviction policy가 같은 recency 정보를 재사용하는 구조이다.

Result

평가는 UVM-enabled GPGPU-Sim 4.0과 UVM emulator를 확장해 수행했다. 기본 설정은 NVIDIA Turing-style GPU, 4KB page, 45us far-fault handling latency, PCIe 3.0 x16, 150% memory oversubscription이다. benchmark는 UVMBench, InterplayUVM, Tango 기반 15개 workload이며, 추가로 Accel-Sim 통합 환경에서 AlexNet, ResNet50, BERT, Whisper를 평가했다.

주요 결과는 다음과 같다.

Overall speedup: Forest는 baseline TBNp 대비 평균 1.72x, SpecForest는 1.86x speedup을 보인다. 논문 초록과 introduction은 SpecForest가 state-of-the-art 대비 최대 1.39x 수준의 향상을 보인다고 요약한다.
Far-fault reduction: linear workload에서는 큰 tree가 2MB boundary를 넘는 sequential access의 far-fault를 줄인다. mixed-pattern workload에서는 HCHI/HCLI object에 작은 tree와 leaf를 적용해 불필요한 migration과 thrashing을 줄인다.
Thrashing reduction: optimal tree configuration만 적용해도 page thrashing이 평균 25% 줄고, access-aware LRU를 추가하면 추가로 7% 감소한다.
Profiling overhead reduction: SpecForest는 평균 profiling step을 223회에서 10회로 줄인다. pattern recording이 특히 mixed-pattern benchmark에서 효과가 크고, static analysis와 similarity detection은 각각 평균 2% 수준의 추가 speedup을 준다.
Sensitivity: oversubscription 125%에서 200%까지 SpecForest는 baseline 대비 1.57x에서 1.95x speedup을 유지한다. Pascal, Volta, Turing, Ampere, Hopper-style GPU configuration에서도 이득이 유지된다.
Real-world DL: AlexNet, ResNet50, BERT, Whisper에서 SpecForest는 평균 1.51x, 최대 1.62x speedup을 보인다. CNN은 LS가 지배적이고, Transformer 계열은 self-attention의 irregular dense access 때문에 HCHI 비중이 커져 per-object heterogeneous prefetching의 가치가 커진다.

이 결과는 성능 개선이 단순한 prefetch aggressiveness 증가가 아니라, access pattern에 맞춘 tree shape와 GPU-side recency 기반 eviction이 far-fault와 memory thrashing을 동시에 줄인 결과라는 논문의 thesis를 뒷받침한다.

Contribution

TBNp의 homogeneous configuration이 unnecessary migration과 page thrashing을 유발한다는 문제를 workload, kernel, data object 수준에서 분석했다.
GPU page access counter를 access recency tracker로 재해석하는 ATT와 UVM driver 기반 APD를 결합해 object-level access pattern detection을 제안했다.
isolation bit와 motion bit로 기존 TBNp semantics를 유지하면서 tree size와 leaf size를 object별로 조정하는 heterogeneous TBNp mechanism을 설계했다.
access counter 기반 pseudo-LRU eviction으로 far-fault history가 아닌 실제 GPU-side access recency를 eviction에 반영했다.
SpecForest를 통해 compiler-assisted static detection, pattern recording, similarity group propagation으로 runtime profiling delay를 줄였다.
general-purpose GPU benchmark와 real-world DL workload에서 Forest/SpecForest가 baseline 및 prior UVM optimization보다 높은 성능을 보임을 시뮬레이션으로 평가했다.

Criticisms

하드웨어 수정이 필요하기 때문에, 현재 Production-level에서 사용되는 Set-up에서는 사용할 수 없다.
조금 너무 어렵게 논문이 쓰인 느낌이 있다. Top-down방식으로 Componenet-by-componenet으로 서술하면 좀더 쉽게 설명할 수 있을 것 같다는 생각이 든다. 예를 들어서 굳이 APD나 ATT혹은 Prefetcher Engine은 너무 Classficiation하기 위한 용어처럼 느껴진다 (E.g., Speculative Forest -> Compiler Optimization 으로 변경, APD -> Forest APIs 로 변경, Overview 논문 Design 섹션전에 삽입.. 등). 또한 중복되는 설명이 많다.
Object의 access pattern이 runtime 중간에 dynamic하게 바뀌는 경우에는 둔감할 수 있다. Forest는 APD가 object의 pattern을 한 번 결정하면 ATT의 cease bit를 set하여 해당 object에 대한 추가 profiling interrupt를 멈춘다. 따라서 같은 kernel execution 안에서 object가 처음에는 LS처럼 동작하다가 이후 sparse/irregular pattern으로 바뀌는 intra-kernel phase change는 잘 따라가지 못할 수 있다. 이는 profiling overhead를 줄이기 위한 early classification의 tradeoff이다.

Conclusion

Forest는 GPU UVM의 병목을 "page fault를 얼마나 빨리 처리하는가"보다 "어떤 data object를 어떤 단위로 미리 가져오고 언제 내보내는가"의 문제로 본다. 이 관점에서 TBNp의 tree 구조를 고정 정책이 아니라 access pattern별 policy substrate로 바꾸고, device-side recency를 driver decision에 연결한다.

나중에 기억할 점은 Forest가 prefetcher를 더 aggressive하게 만든 논문이 아니라는 것이다. LS에는 aggressiveness를 키우고, irregular sparse/high-coverage access에는 aggressiveness를 줄이며, eviction도 실제 access recency에 맞춘다. 즉, UVM oversubscription에서 prefetching과 eviction은 같은 memory pressure budget을 공유한다는 점을 설계로 보여준 paper이다.

Assumptions and Verification

Metadata는 PDF extracted text의 ACM reference format과 첫 페이지 정보를 기준으로 작성했다.
PDF source는 `/home/jhyohan/MPDK/mBPF-Usecase/.nori/runs/focused-signal/sources/3695053.3731047.pdf`, extracted text는 같은 directory의 `3695053.3731047.txt`를 사용했다.
Figure/table의 일부 텍스트는 `pdftotext` extraction에서 layout noise가 있었으므로, 수치와 mechanism은 본문 문장과 caption 중심으로 확인했다.
MediaWiki 검색 connector는 제공되지 않았으므로 link 이름은 noriwiki에 있을 가능성이 높은 reusable concept 이름으로 작성했다.