Skip to content

multiple Tensorflow / CUDA versions again #263

Description

@bertsky

To my knowledge, despite our efforts to work around the Tensorflow dependency hell (each TF version being tied closely to a narrow range of CUDA / Python / Numpy versions, and in turn CUDA being dependent on certain libcudnn / nvidia-driver), we have not yet tackled the problem of providing GPU access to multiple OCR-D processors relying on different TF versions at the same time yet.

However, for native installations, the solution is not far away: Since Nvidia put the version numbers into all the package names, it is in principle possible to install multiple versions of CUDA runtime and cuDNN at the same time – as long as they all can agree on a suitable nvidia-driver (which is usually the newest; luckily, this one appears to be largely backwards compatible). The problem is that TF loads the libcudart dynamically and to that end, needs the right version in the dynamic linker/loader's search path. But the CUDA packages seem to only activate the last installed CUDA toolkit in ld.so.conf. This is easily fixed, however:

# get them all
apt install cuda-10-0 cuda-10-1 cuda-10-2 cuda-11-0 cuda-11-1 libcudnn7 libcudnn8
# /etc/ld.so.conf.d/cuda.conf:
/usr/local/cuda-10.0/lib64
/usr/local/cuda-10.0/targets/x86_64-linux/lib
/usr/local/cuda-10.1/lib64
/usr/local/cuda-10.1/targets/x86_64-linux/lib
/usr/local/cuda-10.2/lib64
/usr/local/cuda-10.2/targets/x86_64-linux/lib
/usr/local/cuda-11.0/lib64
/usr/local/cuda-11.0/targets/x86_64-linux/lib
/usr/local/cuda-11.1/lib64
/usr/local/cuda-11.1/targets/x86_64-linux/lib

This _does_ work:

for venv in venv/local/sub-venv/headless-tf*; do . $venv/bin/activate && python -c "import tensorflow as tf; print(tf.test.is_gpu_available())"; done

2021-06-16 22:07:03.732233: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-06-16 22:07:03.755631: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399905000 Hz
2021-06-16 22:07:03.756510: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4769990 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-16 22:07:03.756579: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-06-16 22:07:03.766686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-06-16 22:07:03.864881: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.867961: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47f8450 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-06-16 22:07:03.867978: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce GTX 1080, Compute Capability 6.1
2021-06-16 22:07:03.868119: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.868427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: NVIDIA GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.797
pciBusID: 0000:01:00.0
2021-06-16 22:07:03.868621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-06-16 22:07:03.869537: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-16 22:07:03.870339: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-06-16 22:07:03.870557: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-06-16 22:07:03.871640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-06-16 22:07:03.872446: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-06-16 22:07:03.875056: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-06-16 22:07:03.875163: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.875540: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.875825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2021-06-16 22:07:03.875868: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-06-16 22:07:03.876405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-16 22:07:03.876430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2021-06-16 22:07:03.876435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
2021-06-16 22:07:03.876515: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.876831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:03.877132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 7611 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True
2021-06-16 22:07:04.128308: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From <string>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-06-16 22:07:05.085985: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-16 22:07:05.086493: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-06-16 22:07:05.087099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-16 22:07:05.125844: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.126506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.797GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2021-06-16 22:07:05.126547: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-16 22:07:05.140285: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-06-16 22:07:05.140365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-06-16 22:07:05.142732: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-16 22:07:05.143184: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-16 22:07:05.146045: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-06-16 22:07:05.148365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-06-16 22:07:05.148606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-16 22:07:05.148751: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.149475: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.150075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-06-16 22:07:05.150119: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-16 22:07:05.518880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-16 22:07:05.518911: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-06-16 22:07:05.518918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-06-16 22:07:05.519067: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.519449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.519766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:05.520066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 7424 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True
2021-06-16 22:07:06.659267: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-06-16 22:07:06.683630: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3399905000 Hz
2021-06-16 22:07:06.684752: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x44ef1f0 executing computations on platform Host. Devices:
2021-06-16 22:07:06.684824: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2021-06-16 22:07:06.690851: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-06-16 22:07:06.774100: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.774496: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4578500 executing computations on platform CUDA. Devices:
2021-06-16 22:07:06.774513: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): NVIDIA GeForce GTX 1080, Compute Capability 6.1
2021-06-16 22:07:06.774626: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.774907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: NVIDIA GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.797
pciBusID: 0000:01:00.0
2021-06-16 22:07:06.775077: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-06-16 22:07:06.775975: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-16 22:07:06.776779: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-06-16 22:07:06.776996: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-06-16 22:07:06.778054: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-06-16 22:07:06.778882: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-06-16 22:07:06.781527: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-06-16 22:07:06.781637: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.782008: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.782294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-06-16 22:07:06.782340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-06-16 22:07:06.782879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-16 22:07:06.782904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2021-06-16 22:07:06.782910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2021-06-16 22:07:06.782991: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.783319: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-16 22:07:06.783620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 7611 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
True

(Not entirely sure whether we need all of cuda-XY or just individual parts like cuda-cudart-XY cuda-curand-XY cuda-cusolver-XY cuda-cusparse-XY cuda-cublas-XY cuda-cuffs-XY though.)

Thus, all we have to do is document this in the README (and maybe add rules to deps-ubuntu).

For the Docker option, it's the same story: As long as we need to build a fat image accommodating all modules, we have to do the same as above within Docker. Until now, we chose the oldest base image nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04 for ocrd/core-cuda, because we usually needed the TF1 processors to have GPU access more than the TF2 processors. However, with the knowledge from above, we can work our way backwards from an image with the newest nvidia-driver, and install the older CUDA versions in there – via the same extended makefile rules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions