[Docker] docker nvidia: gpu 메모리 접근 오류? "misaligned address"
gpu가 여러개 달려있는 온프레미스 서버에 triton+torchscript로 모델을 서빙하고 있었는데, 갑자기 메모리 접근 오류가 뜨면서 터졌다.

I0624 02:40:37.124328 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124367 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124393 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124409 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124435 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124461 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124467 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124467 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124462 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124461 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124520 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124569 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124534 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.124653 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.125171 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.125211 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
ERROR: [Torch-TensorRT] - 1: [convBaseRunner.cpp::execute::295] Error Code 1: Cask (Cask convolution execution)
I0624 02:40:37.144032 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.144052 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
ERROR: [Torch-TensorRT] - 1: [convBaseRunner.cpp::execute::295] Error Code 1: Cask (Cask convolution execution)
ERROR: [Torch-TensorRT] - 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (misaligned address)
I0624 02:40:37.150279 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.150297 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
ERROR: [Torch-TensorRT] - 1: [convBaseRunner.cpp::execute::295] Error Code 1: Cask (Cask convolution execution)
ERROR: [Torch-TensorRT] - 1: [convBaseRunner.cpp::execute::295] Error Code 1: Cask (Cask convolution execution)
ERROR: [Torch-TensorRT] - 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (misaligned address)
I0624 02:40:37.156748 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.156769 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
ERROR: [Torch-TensorRT] - 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (misaligned address)
I0624 02:40:37.166519 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.166541 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
ERROR: [Torch-TensorRT] - 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (misaligned address)
I0624 02:40:37.167166 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.167179 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
I0624 02:40:37.176777 43 libtorch.cc:2487] Failed to capture elapsed time: Internal - Failed to capture elapsed time: misaligned address
원래 모델 자체는 잘 쓰던거였는데 갑자기 사용중에 터지기 시작했던 것이다.
원인도 못찾고 상당시간을 헤맸는데
이 옵션이 문제였다는 것을 깨달았다.
왜인지 몰라도 이 gpus로 디바이스를 직접 지정하는 것이, 엄밀한 격리를 제공하지 못하는건지 뭔지 유효하지 않은 메모리 침범을 유도하는 것 같았다.
그래서 docker 수준에서는 그냥 all로 줘서 다 쓸 수 있게 하고, 컨테이너 내에서 환경변수 기반으로 GPU를 지정하게 했다.
이러니까 해결은 됐다.