Skip to content

Tried to install this container but seem to have directory issue #5

@Bodge-IT

Description

@Bodge-IT

Running CoreOs as a VM on XCP-NG Xen server with full device PCI passthrough. GPU is a Dell branded Nvidia Quadro P400.

When I run docker run --name=nvidia-drivers -v /:/rootfs --privileged bugroger/coreos-nvidia-driver:2135.5.0-390.77-geforce, I get:

+ ROOT_MOUNT_DIR=/root
+ NVIDIA_DRIVER_VERSION=390.77
+ NVIDIA_DRIVER_COREOS_VERSION=2135.5.0
+ NVIDIA_PRODUCT_TYPE=geforce
+ [[ ! -f /root/etc/os-release ]]
+ error \'File /root/etc/os-release not found, /etc/os-release must be mounted into this container.\'
/install.sh: line 20: error: command not found

So I changed docker run cmd to "-v /:/root" and that seemed to work but when testing for nvidia-smi:
nvidia-smi
-bash: nvidia-smi: command not found

So pretty sure something not right.

Update: OK, so I realised I was getting the loading module issue reported here and after applying @rikatz modprobe fix, my docker build gets past the insert module issue (although not persistent after reboot), but fails with:

Unable to determine the device handle for GPU 0000:00:05.0: Unknown Error
+ umount /lib/modules/4.19.50-coreos-r1/video
+ umount /usr/lib/x86_64-linux-gnu
+ umount /usr/bin

Is this related to my device? I can see:
[INFO 2019-07-22 12:20:24 UTC] Driver compatible! NVIDIA 390.77 (geforce) compiled for CoreOS 2135.5.0
further up in the log. Checking for nvidia:

lsmod | grep -i nvidia
nvidia_modeset       1110016  0
nvidia_drm             16384  0
nvidia_uvm            884736  0
nvidia              14393344  2 nvidia_uvm,nvidia_modeset
ipmi_msghandler        57344  2 ipmi_devintf,nvidia
i2c_core               61440  3 nvidia,psmouse,i2c_piix4

but nvidia-smi still unknown in system

I get this in coreos with dmesg

[ 1076.714161] nvidia: module license 'NVIDIA' taints kernel.
[ 1076.718835] Disabling lock debugging due to kernel taint
[ 1076.727040] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 1076.737696] nvidia: Unknown symbol ipmi_create_user (err -2)
[ 1076.744516] nvidia: Unknown symbol ipmi_destroy_user (err -2)
[ 1076.749470] nvidia: Unknown symbol ipmi_validate_addr (err -2)
[ 1076.754442] nvidia: Unknown symbol ipmi_free_recv_msg (err -2)
[ 1076.759226] nvidia: Unknown symbol ipmi_set_my_address (err -2)
[ 1076.764128] nvidia: Unknown symbol ipmi_request_settime (err -2)
[ 1076.769042] nvidia: Unknown symbol ipmi_set_gets_events (err -2)

and then this further down...

[ 1094.540821] xen: --> pirq=16 -> irq=36 (gsi=36)
[ 1094.541314] nvidia 0000:00:05.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 1094.549301] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  390.77  Tue Jul 10 18:28:52 PDT 2018 (using threaded interrupts)
[ 1095.011419] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 245
[ 1095.056452] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  390.77  Tue Jul 10 22:10:46 PDT 2018
[ 1095.107396] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.112302] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 1095.129106] NVRM: RmInitAdapter failed! (0x23:0x56:470)
[ 1095.133799] NVRM: rm_init_adapter failed for device bearing minor number 0

I also tested the stable-396.44-tesla drivers and get the same issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions