Home > Enterprise >  Older driver, newer CUDA Toolkit leads to container startup failure - any configuration workarounds?
Older driver, newer CUDA Toolkit leads to container startup failure - any configuration workarounds?

Time:01-07

NVIDIA starting from CUDA 11.x should in theory guarantee compatibility of their CUDA Toolkit libraries (typically shipped inside docker containers) and the driver library libcuda.so (installed on the host). This should be true at least when we stay within all minor versions of CUDA (11.0 to 11.8).

It should be therefore possible to run containers with newer versions of CUDA on hosts with pre-installed GPU drivers built for older CUDA versions. This does not work in practice though - CUDA-enabled containers (including the official nvidia/cuda) fail to run in such scenarios.

Are there any configuration workarounds that would at least enable containers to start (to test if apps have GPU access), if upgrading the driver libraries on the host is not feasible and downgrading the containerized CUDA Toolkit is time consuming and would potentially lower functionality?

CodePudding user response:

Workarounds such as NVIDIA_DISABLE_REQUIRE as recommended by an NVIDIA employee on Github here will ultimately fail (as documented here) to deliver GPU access for your apps. You need to synchronize CUDA versions between the driver libraries (on the host) and CUDA Toolkit (in the container), by either of two things:

  • upgrade the host driver libraries (preferred),
  • downgrade the container CUDA Toolkit.

CodePudding user response:

According to NVIDIA docs setting this env. variable to true (or 1) should disable CUDA version check at startup, and should work within the same major CUDA version (thanks to minor version compatibility):

NVIDIA_DISABLE_REQUIRE=1

I must warn you however, that this workaround works only superficially, letting your container with mismatched (newer) CUDA Toolkit start (no longer crashing on failing CUDA version check). In my case the workaround helped start the container with 11.8 CUDA Toolkit on a machine with CUDA 11.2 driver libraries. But the workaround will ultimately fail as soon when you try to test some ML algos in the GPU, they will fail to train in the model, printing error messages with various levels of specificity (with LightGBM even apparently "working", but at... 0% GPU utilization, i.e. silently failing). The most specific error message was given by Catboost:

CatBoostError: catboost/cuda/cuda_lib/cuda_base.h:281: CUDA error 803: system has unsupported display driver / cuda driver combination

while XGBoost errored with a rather misleading message:

XGBoostError: [17:49:24] ../src/gbm/gbtree.cc:554: Check failed: common::AllVisibleGPUs() >= 1 (0 vs. 1) : No visible GPU is found for XGBoost.

(Both of the above algos start working correctly in the GPU after CUDA Toolkit in the container is downgraded to match CUDA version from the host).

  • Related