when I try to start omnisci server , its enter CPU-only mode. it couldn’t instantiate CudaMgr.
this is the error shown “DBHandler.cpp:266 Unable to instantiate CudaMgr, falling back to CPU-only mode. CUDA Error (999): unknown error”
It’s because the system cannot detect correctly the GPUs
Could you post the output of nvidia-smi command?
What system are you on? (OS, Hardware)
Which version of OmniSciDB have you installed?
After the Cuda unknown error are you getting something like no gpus detected?
I am sorry to ask you a lot of questions but the 999 error is quite generic
I did some tests, also using a similar driver of your (455.23.04 can’t find the 05 anywhere), and I can’t reproduce your issue.
It looks there is something that’s preventing you from using the GPUs. We got troubles recently with Nvidia Fabric Manager on DGX and HGX systems, but I don’t think your system has an NV-link switch, but maybe I’m wrong.
Which kind of hardware are you using? It’s an on-premise physical machine or it’s an AWS Instance (on an AWS Instance I could reproduce)
Also, the 999 could mean that the Nvidia driver is in a bad state, and a reboot (or a driver reset) is needed. Can you try to reboot the machine and re-try?
I had the same issue on a Dell Cauldron with 8x T4s and an HPE Apollo 6500 with 8x A100s sxm2. Running ubuntu 20.04 I uninstalled all nvidia-* and cuda-* packages installed with apt then rebooted. Then used the latest 460.73.01 driver installed from the run file. Then installed cuda-toolkit-11.2 from the run file so that it did not install drivers. Then install nvidia-fabricmanager and started the service daemon. Started nv-hostengine and persistenced daemons, and ensured the post install actions were completed Installation Guide Linux :: CUDA Toolkit Documentation. Rebooted then started omnisci_server.