CuCheck

개요

Compute Sanitizer는 NVIDIA CUDA 프로그램의 기능적 correctness bug를 런타임에서 찾는 도구 모음이다. 핵심 역할은 GPU kernel 실행 중에 발생한 메모리 접근 오류, shared memory race, 초기화되지 않은 memory read, synchronization primitive 오용을 관찰하고, 가능한 경우 thread/block 좌표와 host/device backtrace로 원인을 돌려주는 것이다.

이 문서는 NVIDIA 공식 문서 v2026.2.1 기준으로 Compute Sanitizer의 사용자 관점 동작 원리를 요약한다. 내부 구현 전체가 아니라 문서에 드러난 실행 모델과 check semantics에 초점을 둔다.

Motivation

CUDA 프로그램은 수천 개 이상의 thread가 같은 kernel 안에서 병렬로 실행되므로, CPU 프로그램보다 오류가 재현되기 어렵다. 특히 다음 오류는 일반적인 crash log만으로 원인을 찾기 힘들다.

잘못된 index 계산으로 인한 global/shared/local memory out-of-bounds 접근
misaligned access, device-side malloc/free 오용, CUDA API error
shared memory를 통한 thread 간 communication에서 생기는 read/write ordering bug
__syncthreads(), __syncwarp(), Cooperative Groups, cuda::barrier의 잘못된 사용
초기화되지 않은 device memory read

Compute Sanitizer는 이런 문제를 실행 중인 CUDA application 위에서 직접 검사한다. 즉, 정적 분석처럼 가능한 모든 경로를 증명하는 도구라기보다, 실제로 실행된 kernel과 CUDA API event를 instrumentation과 runtime metadata로 추적하는 dynamic checking suite에 가깝다.

Main Idea

핵심 아이디어는 target application을 compute-sanitizer frontend 아래에서 실행하거나 attach한 뒤, 선택한 tool의 rule에 맞게 GPU kernel과 CUDA runtime/driver event를 관찰하는 것이다.

기본 모델은 runtime binary instrumentation이다. Compute Sanitizer는 user kernel code를 instrument하여 memory access, synchronization, allocation lifetime 같은 event를 수집한다. NV_COMPUTE_SANITIZER_BINARY_PATCHING 환경변수가 user kernel instrumentation을 제어한다는 점에서 이 동작이 드러난다.

memcheck는 추가로 compile-time instrumentation을 지원한다. nvcc -fdevice-sanitize=memcheck 또는 ptxas -sanitize=memcheck로 instrument된 code를 만들면, 실행 시 Compute Sanitizer가 이를 감지한다. 이 경로는 runtime binary instrumentation보다 빠를 수 있고, pointer가 어느 allocation을 가리키는지 추적하는 base-and-bounds analysis로 인접 allocation 침범을 더 잘 잡을 수 있다. 현재 compile-time instrumentation은 memcheck 전용이며, racecheck, initcheck, synccheck는 runtime binary instrumentation이 필요하다.

실행 모델

기본 실행 형태는 다음과 같다.

compute-sanitizer [options] app_name [app_options]
compute-sanitizer --tool memcheck ./app
compute-sanitizer --tool racecheck ./app
compute-sanitizer --tool initcheck ./app
compute-sanitizer --tool synccheck ./app

실행 모델에서 중요한 점은 다음과 같다.

기본 tool은 memcheck이다.
기본적으로 application의 모든 kernel을 검사하지만, --kernel-name과 --kernel-name-exclude로 kernel scope를 제한할 수 있다.
standalone Compute Sanitizer는 기본적으로 nonblocking mode로 kernel을 launch하므로 concurrent kernel error reporting을 지원한다. --force-blocking-launches는 kernel launch를 직렬화하지만, 보고되는 error 수와 precision에 영향을 줄 수 있다.
source attribution을 좋게 하려면 -lineinfo 또는 -G가 필요하다. host backtrace의 symbol name은 Linux 기준 -Xcompiler -rdynamic 같은 host symbol 보존 option에 영향을 받는다.
error action은 error type에 따라 다르다. host-side API/leak report는 보통 application을 계속 실행시키지만, device-side memory/synchronization error는 kernel 또는 CUDA context termination으로 이어질 수 있다.

Tool 구성

Tool	주된 검사 대상	핵심 원리
`memcheck`	memory access, allocation lifetime, CUDA API error, leak	global/local/shared memory access를 allocation boundary와 alignment rule에 대해 검사한다. precise error는 thread/block 좌표, PC, 접근 주소, 접근 크기, source line을 보고할 수 있다.
`racecheck`	shared memory data race	on-chip shared memory에 대해 WAW, WAR, RAW hazard를 찾는다. 현재 핵심 범위는 shared memory이며, global memory race detector로 이해하면 안 된다.
`initcheck`	uninitialized device memory read	device-side write, `cudaMemcpy`, `cudaMemset` 등으로 초기화되었는지를 추적하여 초기화되지 않은 memory read를 보고한다. 기본 address space는 global memory이며 option으로 shared 또는 all을 선택할 수 있다.
`synccheck`	synchronization primitive misuse	`__syncthreads()`, `__syncwarp()`, Cooperative Groups API, `cuda::barrier` 사용이 architecture와 mask/barrier rule을 만족하는지 검사한다.

Memcheck

memcheck는 Compute Sanitizer의 중심 tool이다. 다음 오류를 주로 다룬다.

out-of-bounds 또는 misaligned global/local/shared/global atomic access
GPU hardware exception
device code의 malloc/free error: double free, invalid pointer free, heap corruption 등
CUDA API error
cudaMalloc allocation leak과 device heap leak
stream-ordered allocation race: cudaMallocAsync/cudaFreeAsync가 stream synchronization 없이 다른 stream에서 사용되는 use-before-alloc, use-after-free

기본 allocation boundary check는 인접한 allocation 사이의 overflow를 놓칠 수 있다. 예를 들어 두 global allocation이 virtual address space에서 붙어 있으면, 첫 번째 buffer overflow가 두 번째 buffer 내부 접근처럼 보일 수 있다. 이를 줄이기 위해 --padding option은 CUDA allocation 뒤에 invalid padding buffer를 붙여 global memory OOB detection을 강화한다. 다만 padding은 device memory pressure를 높일 수 있다.

compile-time memcheck instrumentation의 핵심은 base-and-bounds이다. CUDA allocation API는 서로 다른 allocation의 locality를 보장하지 않으므로, 한 allocation을 가리키는 pointer로 다른 allocation을 접근하는 것은 illegal하다. compile-time instrumentation은 pointer가 참조하는 allocation identity를 추적하여 인접 allocation 접근을 더 잘 보고한다.

Memcheck instrumentation

조사 결론부터 말하면, Compute Sanitizer memcheck를 AddressSanitizer처럼 shadow = (addr >> k) + offset 형태의 공개된 shadow-address mapping으로 설명할 근거는 없다. NVIDIA 공식 문서가 공개하는 것은 다음 세 가지다.

NV_COMPUTE_SANITIZER_BINARY_PATCHING은 Compute Sanitizer가 user kernel code를 instrument하는지를 제어한다.
Sanitizer Patching API는 global/shared/local memory access instruction에 callback patch를 넣을 수 있음을 공개한다.
memcheck compile-time instrumentation은 base-and-bounds analysis로 pointer가 참조하는 allocation identity를 추적한다고 설명한다.

따라서 공개 근거로 쓸 수 있는 내부 모델은 "VA를 shift해서 shadow byte를 직접 읽는 inline check"가 아니라, "memory instruction을 patch하여 callback으로 실제 access metadata를 넘기고, tool runtime이 allocation/bounds/alignment metadata를 조회한다"에 가깝다.

공개 API가 보여주는 patch point

Sanitizer Patching API에는 다음 instrumentation point가 있다.

SANITIZER_INSTRUCTION_GLOBAL_MEMORY_ACCESS
SANITIZER_INSTRUCTION_SHARED_MEMORY_ACCESS
SANITIZER_INSTRUCTION_LOCAL_MEMORY_ACCESS
SANITIZER_INSTRUCTION_DEVICE_SIDE_MALLOC
SANITIZER_INSTRUCTION_DEVICE_SIDE_FREE

global/shared/local memory access는 load, store, atomic instruction에 대응하며 callback type은 다음 형태다.

typedef SanitizerPatchResult (*SanitizerCallbackMemoryAccess)(
    void *userdata,
    uint64_t pc,
    void *ptr,
    uint32_t accessSize,
    uint32_t flags,
    const void *pData);

여기서 pc는 patched instruction의 program counter이고, ptr은 접근 주소이다. local/shared memory access에서는 이 값이 해당 memory window 안의 offset으로 전달된다. accessSize는 1, 2, 4, 8, 16 byte 중 하나이고, flags는 read/write/atomic/scope 정보를 담는다. write의 경우 pData는 쓰려는 새 값에 대한 pointer로 전달된다.

즉 공개 API 기준의 instrumentation은 다음처럼 생긴다.

// Tool initialization side: CUDA module에 patch point를 등록한다.
sanitizerAddPatchesFromFile("memcheck_callbacks.fatbin", ctx);

sanitizerPatchInstructions(SANITIZER_INSTRUCTION_GLOBAL_MEMORY_ACCESS,
                           module, "__memcheck_global_access");
sanitizerPatchInstructions(SANITIZER_INSTRUCTION_SHARED_MEMORY_ACCESS,
                           module, "__memcheck_shared_access");
sanitizerPatchInstructions(SANITIZER_INSTRUCTION_LOCAL_MEMORY_ACCESS,
                           module, "__memcheck_local_access");
sanitizerPatchInstructions(SANITIZER_INSTRUCTION_DEVICE_SIDE_MALLOC,
                           module, "__memcheck_device_malloc");
sanitizerPatchInstructions(SANITIZER_INSTRUCTION_DEVICE_SIDE_FREE,
                           module, "__memcheck_device_free");

sanitizerPatchModule(module);
sanitizerSetCallbackData(kernel, device_side_metadata_buffer);

patched instruction은 개념적으로 다음과 같이 callback을 호출한 뒤 원래 memory instruction을 수행한다고 볼 수 있다. 실제 patch sequence와 register save/restore 방식은 공개되어 있지 않다.

// Original SASS/PTX-level meaning
st.global.u32 [addr], value;

// Public API에서 추론 가능한 patch shape
call __memcheck_global_access(userdata,
                              pc_of_original_store,
                              addr,
                              4,
                              SANITIZER_MEMORY_DEVICE_FLAG_WRITE,
                              &value);
st.global.u32 [addr], value;

callback 내부의 check는 다음처럼 allocation metadata와 access metadata를 결합한다고 보는 편이 공개 자료와 가장 잘 맞는다. 아래 helper 이름과 자료구조는 설명용 pseudo-code이다.

extern "C" __device__
SanitizerPatchResult __memcheck_global_access(void *userdata,
                                              uint64_t pc,
                                              void *ptr,
                                              uint32_t size,
                                              uint32_t flags,
                                              const void *pData) {
    MemcheckState *s = (MemcheckState *)userdata;
    uintptr_t addr = (uintptr_t)ptr;

    AccessKind kind = decode_read_write_atomic(flags);

    AllocationMeta *m = lookup_allocation_containing(s, addr);
    bool in_bounds =
        m != NULL &&
        addr >= m->base &&
        addr + size <= m->base + m->size &&
        m->state == ALLOCATED;

    bool aligned = is_aligned_for_instruction(addr, size, flags);

    if (!in_bounds || !aligned) {
        record_precise_error(s, pc, addr, size, flags,
                             blockIdx, threadIdx);
        return SANITIZER_PATCH_ERROR;
    }

    return SANITIZER_PATCH_SUCCESS;
}

이 모델에서 핵심은 ptr 자체가 callback argument로 전달된다는 점이다. ASan처럼 모든 load/store 앞에 shadow_addr = (addr >> 3) + offset을 inline으로 계산한다는 공개 증거는 없다. 물론 내부 metadata table을 shadow memory처럼 구성할 수는 있지만, 그 mapping 산식은 문서화되어 있지 않다.

Shadow memory와의 관계

ASan의 공개 algorithm은 8 byte application memory를 1 byte shadow memory에 매핑하고, 대표적으로 다음 형태의 check를 사용한다.

shadow_address = (address >> 3) + SHADOW_OFFSET;
shadow_value = *shadow_address;
if (shadow_value != 0 &&
    ((address & 7) + access_size - 1) >= shadow_value) {
    ReportError(address, access_size, is_write);
}

Compute Sanitizer memcheck도 어떤 형태의 shadow/metadata memory를 내부적으로 둘 가능성은 높다. padding, allocation lifetime, initializedness, device heap state, stream-ordered allocation state를 추적하려면 원본 program memory와 별도의 metadata가 필요하기 때문이다. 하지만 "VA를 몇 bit shift해서 shadow address를 얻는다"는 ASan식 mapping을 Compute Sanitizer에 그대로 적용하는 것은 현재 공개 근거로는 부정확하다.

공개 API와 문서로부터 더 보수적으로 쓸 수 있는 모델은 다음이다.

// Compute Sanitizer memcheck에 대해 공개 근거가 있는 수준의 모델
metadata_key = classify_address_space_and_address(ptr, flags);
metadata = lookup_metadata(metadata_key);

check_bounds(metadata, ptr, accessSize);
check_alignment(ptr, accessSize, flags);
check_lifetime_or_stream_order(metadata, current_stream);
report(pc, ptr, accessSize, flags, blockIdx, threadIdx);

만약 내부 구현이 shadow-address 방식이라면, 그 코드는 다음과 비슷한 형태일 수 있다. 그러나 이것은 Compute Sanitizer의 확인된 코드가 아니라 가능한 구현 선택지다.

// Hypothetical only: Compute Sanitizer 문서에는 이 산식이 공개되어 있지 않다.
shadow = shadow_base + ((addr - device_va_base) >> GRANULE_SHIFT);
meta = *shadow;

if (!meta.addressable ||
    access_crosses_granule(addr, size, meta) ||
    meta.alloc_id != expected_alloc_id) {
    report_error(pc, addr, size, flags);
}

Compile-time base-and-bounds

compile-time memcheck instrumentation은 shadow byte 하나만 보는 방식보다 allocation identity에 더 가깝다. NVIDIA 문서는 CUDA allocation API가 서로 다른 allocation의 locality를 보장하지 않으므로, 한 allocation을 가리키는 pointer로 다른 allocation을 접근하는 것은 illegal이며, compiler instrumentation이 pointer가 참조하는 allocation을 추적한다고 설명한다.

이를 pseudo-code로 쓰면 다음과 같다.

__global__ void saxpy_compile_time_model(float *y, const float *x,
                                         float a, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    Bounds bx = __memcheck_bounds_of_pointer(x);  // base, size, allocation id
    Bounds by = __memcheck_bounds_of_pointer(y);

    const float *xp = &x[i];
    if (!within_same_allocation(xp, sizeof(float), bx) ||
        !is_aligned(xp, sizeof(float))) {
        __memcheck_report_load(CURRENT_PC, xp, sizeof(float));
    }
    float xv = *xp;

    float *yp = &y[i];
    if (!within_same_allocation(yp, sizeof(float), by) ||
        !is_aligned(yp, sizeof(float))) {
        __memcheck_report_store(CURRENT_PC, yp, sizeof(float));
    }
    *yp = a * xv + *yp;
}

이 방식이 중요한 이유는 adjacent allocation 때문이다.

float *a;
float *b;
cudaMalloc(&a, 1024 * sizeof(float));
cudaMalloc(&b, 1024 * sizeof(float));

// Bug: a[1024]가 VA상으로는 b 안에 떨어질 수 있다.
// 단순 "주소가 어떤 allocation 안인가"만 보면 놓칠 수 있다.
// base-and-bounds는 a에서 파생된 pointer가 a의 bounds를 넘었는지를 본다.

따라서 Compute Sanitizer memcheck를 설명할 때는 "ASan처럼 shift 기반 shadow address를 직접 계산한다"고 단정하기보다, "binary patching으로 memory instruction callback을 삽입하고, callback/runtime metadata가 bounds/alignment/lifetime을 검사한다. compile-time mode에서는 pointer-to-allocation identity를 추가로 추적한다"고 쓰는 것이 현재 공개 자료에 맞다.

Racecheck

racecheck는 shared memory를 thread 간 communication buffer로 사용할 때 생기는 ordering bug를 찾는다. 보고하는 canonical hazard는 다음 세 가지다.

WAW: 두 thread가 같은 location에 write하여 최종 값이 실행 순서에 의존한다.
WAR: 한 thread가 read한 location을 다른 thread가 write하면서 ordering assumption이 깨진다.
RAW: write가 commit되기 전에 다른 thread가 같은 location을 read한다.

핵심은 "같은 shared memory location에 대한 access가 synchronization으로 결정적으로 ordering되어 있는가"이다. CUDA kernel에서는 보통 필요한 위치에 __syncthreads() 등을 넣거나, 여러 thread가 같은 shared state를 쓰지 않도록 algorithm을 바꾸는 방식으로 수정한다.

Initcheck

initcheck는 memory access error checker가 아니라 uninitialized read checker이다. 따라서 NVIDIA 문서는 먼저 memcheck로 memory access error가 없는지 확인한 뒤 initcheck를 쓰는 흐름을 권장한다.

기본적으로 global memory의 uninitialized access를 검사한다. --initcheck-address-space shared 또는 --initcheck-address-space all을 사용하면 shared memory도 검사 범위에 넣을 수 있다. --track-unused-memory는 allocation 중 실제로 write되지 않은 영역을 보고하여 over-allocation이나 dead buffer를 찾는 데 사용할 수 있다.

Synccheck

synccheck는 barrier와 warp-level synchronization이 올바르게 사용되는지 확인한다. 대표적으로 다음 오류를 보고한다.

block 또는 warp 안의 divergent thread가 barrier에 도달하는 경우
__syncwarp() mask와 실제 도달 thread 집합이 맞지 않는 경우
Cooperative Groups 또는 cuda::barrier 사용이 필요한 initialization/arrival/wait rule을 만족하지 않는 경우

이 tool도 memory access checker가 아니므로, synchronization error를 보기 전에 memcheck를 먼저 통과시키는 편이 좋다.

사용 패턴

일반적인 debugging sequence는 다음처럼 잡을 수 있다.

nvcc -lineinfo -Xcompiler -rdynamic -o app app.cu
compute-sanitizer --tool memcheck --leak-check full ./app
compute-sanitizer --tool initcheck ./app
compute-sanitizer --tool racecheck ./app
compute-sanitizer --tool synccheck ./app

큰 application에서는 다음 option들이 실용적이다.

--kernel-name, --kernel-name-exclude: 문제가 의심되는 kernel만 검사한다.
--launch-skip, --launch-count: 반복 launch 중 일부만 검사한다.
--print-limit 0: error print 제한을 해제한다.
--save, --read, --xml: 결과를 저장하거나 후처리한다.
--suppressions: 알려진 false positive report를 suppression file로 숨긴다.
--force-synchronization-limit, --force-blocking-launches: memory footprint 문제를 완화하지만 concurrency와 error report 양상을 바꿀 수 있다.

한계와 해석 주의

Compute Sanitizer는 실행된 path를 검사한다. 따라서 test input이 문제 경로를 실행하지 않으면 bug를 찾지 못한다.

memcheck는 invalid GPU access가 실제로 발생했을 때 강하다. 하지만 high-level semantic bug, 예를 들어 host-side shape 계산 오류나 logical sub-allocation boundary 침범은 GPU address가 여전히 어떤 valid allocation 안에 들어가면 직접 보고되지 않을 수 있다. 이런 경우 compile-time base-and-bounds, --padding, 더 작은 focused test가 detection 가능성을 높인다.

racecheck는 shared memory hazard detector이지 모든 CUDA data race detector가 아니다. global memory ordering bug, atomic protocol bug, inter-kernel ordering bug는 별도의 reasoning이나 test가 필요하다.

tool들은 runtime tracking data를 유지하므로 memory footprint가 커질 수 있다. concurrent kernel launch가 많은 application에서는 internal out-of-memory 때문에 일부 launch가 추적되지 않을 수 있으며, 이때는 synchronization limit이나 blocking launch option을 검토해야 한다.

Conclusion

Compute Sanitizer의 본질은 CUDA kernel을 실행하면서 "이 access와 synchronization이 CUDA memory/lifetime/ordering rule을 만족하는가"를 runtime metadata로 검증하는 것이다. memcheck가 allocation boundary와 access attribution을 담당하고, racecheck, initcheck, synccheck가 각각 shared memory ordering, initialization state, synchronization validity를 보완한다. 따라서 CUDA correctness debugging에서는 crash 재현 이후 가장 먼저 돌려볼 dynamic checker로 이해하는 것이 좋다.

참고 자료

NVIDIA Compute Sanitizer Documentation - v2026.2.1, 2026-07-02 확인.
NVIDIA Sanitizer Patching API - memory access callback과 instruction patch point 확인.
NVIDIA Sanitizer Memory API - callback 내부에서 사용할 수 있는 sanitizer-side device/host memory API 확인.
AddressSanitizer Algorithm - shadow memory shift mapping 비교용. Compute Sanitizer의 내부 산식 근거로 사용하면 안 된다.