Unable to start Heavy 6.0 in Docker

Hi, we tried to upgrade from OmniSci 5.10.2 to Heavy 6.0 but got stuck at the server initialization.

We used a customized image based on the heavyai/heavyai-ee-cuda:v6.0.0, but trying to narrow this problem down, I used the base to reproduce this problem.

In GCP, we have a VM with a NVIDIA Tesla V100 GPU available. In this instance we execute the following commands.

To run the container, I executed:

docker run -it --rm \
  --name heavyai-test \
  --gpus all \
  --entrypoint /bin/bash \
  -v /var/lib/omnisci/omnisci-storage:/omnisci-storage \
  heavyai/heavyai-ee-cuda:v6.0.0

Then, I tried to run the server but got a CUDA error that stops the process, it is related to some library version mismatch.

root@e309cd4ea29c:/opt/heavyai# /opt/heavyai/bin/heavydb /omnisci-storage/storage --config /omnisci-storage/heavy.conf --log-severity-clog INFO
2022-06-21T05:36:58.420016 I 38 0 0 CommandLineOptions.cpp:2009  Max import threads 32
2022-06-21T05:36:58.420680 I 38 0 0 CommandLineOptions.cpp:2012  cuda block size 0
2022-06-21T05:36:58.420699 I 38 0 0 CommandLineOptions.cpp:2013  cuda grid size  0
2022-06-21T05:36:58.420708 I 38 0 0 CommandLineOptions.cpp:2014  Min CPU buffer pool slab size 268435456
2022-06-21T05:36:58.420717 I 38 0 0 CommandLineOptions.cpp:2015  Max CPU buffer pool slab size 4294967296
2022-06-21T05:36:58.420725 I 38 0 0 CommandLineOptions.cpp:2016  Min GPU buffer pool slab size 268435456
2022-06-21T05:36:58.420733 I 38 0 0 CommandLineOptions.cpp:2017  Max GPU buffer pool slab size 4294967296
2022-06-21T05:36:58.420741 I 38 0 0 CommandLineOptions.cpp:2018  calcite JVM max memory  1024
2022-06-21T05:36:58.420750 I 38 0 0 CommandLineOptions.cpp:2019  HeavyDB Server Port  6274
2022-06-21T05:36:58.420758 I 38 0 0 CommandLineOptions.cpp:2020  HeavyDB Calcite Port  6279
2022-06-21T05:36:58.420766 I 38 0 0 CommandLineOptions.cpp:2021  Enable Calcite view optimize true
2022-06-21T05:36:58.420775 I 38 0 0 CommandLineOptions.cpp:2023  Allow Local Auth Fallback: enabled
2022-06-21T05:36:58.420787 I 38 0 0 CommandLineOptions.cpp:2025  ParallelTop min threshold: 100000
2022-06-21T05:36:58.420795 I 38 0 0 CommandLineOptions.cpp:2026  ParallelTop watchdog max: 20000000
2022-06-21T05:36:58.420803 I 38 0 0 CommandLineOptions.cpp:2028  Enable Data Recycler: enabled
2022-06-21T05:36:58.420811 I 38 0 0 CommandLineOptions.cpp:2031          Use hashtable cache: enabled
2022-06-21T05:36:58.420820 I 38 0 0 CommandLineOptions.cpp:2034                  Total amount of bytes that hashtable cache keeps: 4096 MB.
2022-06-21T05:36:58.420830 I 38 0 0 CommandLineOptions.cpp:2036                  Per-hashtable size limit: 2048 MB.
2022-06-21T05:36:58.420839 I 38 0 0 CommandLineOptions.cpp:2039          Use query resultset cache: enabled
2022-06-21T05:36:58.420848 I 38 0 0 CommandLineOptions.cpp:2042                  Total amount of bytes that query resultset cache keeps: 4096 MB.
2022-06-21T05:36:58.420857 I 38 0 0 CommandLineOptions.cpp:2044                  Per-query resultset size limit: 2048 MB.
2022-06-21T05:36:58.420866 I 38 0 0 CommandLineOptions.cpp:2047                  Use auto query resultset caching: disabled
2022-06-21T05:36:58.420875 I 38 0 0 CommandLineOptions.cpp:2054                  Use query step skipping: enabled
2022-06-21T05:36:58.420884 I 38 0 0 CommandLineOptions.cpp:2056          Use chunk metadata cache: enabled
2022-06-21T05:36:58.420893 I 38 0 0 CommandLineOptions.cpp:2059          Use chunk metadata cache: enabled
2022-06-21T05:36:58.420902 I 38 0 0 CommandLineOptions.cpp:2070                  Runtime UDF/UDTF Registration Policy:  ALLOWED for superusers only
2022-06-21T05:36:58.421933 I 38 0 0 CommandLineOptions.cpp:1503 License will expire at: 2999-12-31 23:59:59+0000 [MODIFIED]
2022-06-21T05:36:58.421983 I 38 0 0 CommandLineOptions.cpp:1514 HeavyDB started with data directory at '/omnisci-storage/storage'
2022-06-21T05:36:58.421997 I 38 0 0 CommandLineOptions.cpp:1524  Server read-only mode is false
2022-06-21T05:36:58.422008 I 38 0 0 CommandLineOptions.cpp:1528  Threading layer: TBB
2022-06-21T05:36:58.422018 I 38 0 0 CommandLineOptions.cpp:1532  Watchdog is set to true
2022-06-21T05:36:58.422027 I 38 0 0 CommandLineOptions.cpp:1533  Dynamic Watchdog is set to false
2022-06-21T05:36:58.422037 I 38 0 0 CommandLineOptions.cpp:1537  Runtime query interrupt is set to true
2022-06-21T05:36:58.422046 I 38 0 0 CommandLineOptions.cpp:1539  A frequency of checking pending query interrupt request is set to 1000 (in ms.)
2022-06-21T05:36:58.422057 I 38 0 0 CommandLineOptions.cpp:1541  A frequency of checking running query interrupt request is set to 0.1 (0.0 ~ 1.0)
2022-06-21T05:36:58.422075 I 38 0 0 CommandLineOptions.cpp:1544  Non-kernel time query interrupt is set to true
2022-06-21T05:36:58.422085 I 38 0 0 CommandLineOptions.cpp:1547  Debug Timer is set to false
2022-06-21T05:36:58.422094 I 38 0 0 CommandLineOptions.cpp:1548  LogUserId is set to false
2022-06-21T05:36:58.422104 I 38 0 0 CommandLineOptions.cpp:1549  Maximum idle session duration 60
2022-06-21T05:36:58.422114 I 38 0 0 CommandLineOptions.cpp:1550  Maximum active session duration 43200
2022-06-21T05:36:58.422129 I 38 0 0 CommandLineOptions.cpp:1551  Maximum number of sessions -1
2022-06-21T05:36:58.422139 I 38 0 0 CommandLineOptions.cpp:1553 Legacy delimited import is set to true
2022-06-21T05:36:58.422149 I 38 0 0 CommandLineOptions.cpp:1555 Legacy parquet import is set to false
2022-06-21T05:36:58.422158 I 38 0 0 CommandLineOptions.cpp:1558 FSI ODBC import is set to true
2022-06-21T05:36:58.422168 I 38 0 0 CommandLineOptions.cpp:1560 FSI regex parsed import is set to true
2022-06-21T05:36:58.422178 I 38 0 0 CommandLineOptions.cpp:1562 Allowed import paths is set to ["/omnisci-storage"]
2022-06-21T05:36:58.422187 I 38 0 0 CommandLineOptions.cpp:1563 Allowed export paths is set to ["/omnisci-storage"]
2022-06-21T05:36:58.422264 I 38 0 0 DdlUtils.cpp:823 Parsed allowed-import-paths: (/omnisci-storage/storage/import /omnisci-storage)
2022-06-21T05:36:58.422294 I 38 0 0 DdlUtils.cpp:823 Parsed allowed-export-paths: (/omnisci-storage/storage/export /omnisci-storage)
2022-06-21T05:36:58.422337 I 38 0 0 CommandLineOptions.cpp:1634 Disk cache enabled for foreign tables only
2022-06-21T05:36:58.422350 I 38 0 0 CommandLineOptions.cpp:1688 Vacuum Min Selectivity: 0.1
2022-06-21T05:36:58.422362 I 38 0 0 CommandLineOptions.cpp:1690 Enable system tables is set to true
2022-06-21T05:36:58.422371 I 38 0 0 CommandLineOptions.cpp:1699 Enable FSI is set to true
2022-06-21T05:36:58.422386 I 38 0 0 HeavyDB.cpp:430 HeavyDB starting up
2022-06-21T05:36:58.426400 I 38 0 0 DBHandler.cpp:376 OmniSci Server 6.0.0-20220418-d4d1c2a42c
2022-06-21T05:36:58.539412 I 38 0 0 CudaMgr.cpp:369 Using 1 Gpus.
2022-06-21T05:36:58.539662 I 38 0 0 CudaMgr.cpp:68 Warming up the GPU JIT Compiler... (this may take several seconds)
2022-06-21T05:36:58.644111 F 38 0 0 NvidiaKernel.cpp:95 Check failed: cuLinkAddFile_v2( link_state, CU_JIT_INPUT_FATBINARY, gpu_rt_path.c_str(), 0, nullptr, nullptr) == CUDA_SUCCESS (222 == 0) ptxas application ptx input, line 9; fatal   : Unsupported .version 7.4; current version is '7.3'
2022-06-21T05:36:59.425464 I 38 0 1 HeavyDB.cpp:380 Interrupt signal (6) received.
Aborted (core dumped)

Any ideas why is this happening? I would expect the libraries bundled in the base image to be compatible and tested, unless this problem is related to the underlying hardware, namely the GPU.

Hi,

This kind of errors

generally mean that you are using an outdated driver or a mismatch between cuda and the driver itself.

The minimum supported version of the driver is the 470 but it should be changed from 5.10. Could you give me the output of the nvidia-smi?

depending on the image and the OS this command would be the right one.

sudo docker run --gpus=all \
--rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi

Thanks for you reply. Here’s the output of the command you suggested:

[root@omnisci-prod-0-vm omnisci-storage]# sudo docker run --gpus=all \
> --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.0-runtime-ubuntu20.04' locally
11.0-runtime-ubuntu20.04: Pulling from nvidia/cuda
d72e567cc804: Pull complete
0f3630e5ff08: Pull complete
b6a83d81d1f4: Pull complete
651c4abefb41: Pull complete
dfde59c9d941: Pull complete
9b2bcdc98b8a: Pull complete
3c0d268a007b: Pull complete
598190a71a49: Pull complete
Digest: sha256:74be12403e480fe1120f2fc16efef36fa4cb0165d3a3c96d2c09d8652b7312ef
Status: Downloaded newer image for nvidia/cuda:11.0-runtime-ubuntu20.04
Tue Jun 21 15:10:09 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Tesla V1...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    22W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Additionally, here are the OS details of the VM instance.

[root@omnisci-prod-0-vm omnisci-storage]# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Hi,

Unluckily the 465 drivers aren’t supported anymore, so you need to upgrade them to at least the 470 (Cuda version 11.,4) to run the 6.0 version of the software .

We checked the 6.0 with 495 and the 510, but the 470 and 510 are somewhat more popular.

Can you upgrade your drivers or you are using specific software that needs the 465?

Regards,
Candido

I see. No, we only use this instance to run Heavy software, so I’ll proceed to upgrade the drivers now. I’m not very familiar with Nvidia or GPUs resource management in the cloud, so I imagined the drivers being used are coming from the container libraries, not the ones shipped with the OS.

By any change, do you know which packages need to be upgraded in CentOS 7? In either case, I’ll read through the GCP documentation and post the solution if I find it first.

Thanks for your help!

Hi,

To upgrade the drivers there are some instructions on our website, while the best source for that would be the Nvidia website.

On Ubuntu, I personally use the apt command, and I think you can do the same on CentOS with yum.

While talking with a colleague that was working on Zendesk on your issue, we are seeing that you are using the omnisci-storage, that’s has been changed into /var/lib/heavyai in the 6.0 version

so probably you will have to change

-v /var/lib/omnisci/omnisci-storage:/omnisci-storage \

into

-v /var/lib/omnisci/omnisci-storage:/var/lib/heavyai \

Regards,
Candido.

Thank you Candido, the Heavy server is running successfully now.

In the end, I needed to upgrade the host OS and the Nvidia drivers got updated to a very recent version. After a reboot, the new drivers were recognized.

[root@omnisci-prod-0-vm omnisci-storage]# docker run --gpus=all --rm nvidia/cuda:11.0-runtime-ubuntu20.04 nvidia-smi
Tue Jun 21 15:54:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    40W / 300W |    312MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

By the way, I had some problems trying to upgrade the Nvidia packages in CentOS. I found out that I needed to update the public GPG keys from Nvidia and found this post useful: Updating the CUDA Linux GPG Repository Key - #49 by kmittman - Technical Blog - NVIDIA Developer Forums

Hi,

I thought, I sent a private message with the instructions on how to upgrade the drivers, but probably I’m wrong.

Said that I’m happy that everything is working right now.

Candido