A Compute node with four NVIDIA-A100 GPUs and one Mellanox InfiniBand adapter is part of an OpenStack Platform deployment.

The compute node has been configured to provide GPUs in passthrough to the VMs. The schedule for creating a VM with the GPU in passthrough is successful; however, the VM creation failed on the first attempt.

On the GPU Compute node, we found the following error message in /var/log/containers/nova/nova-compute.log:

2021-07-05 18:32:00.954 7 ERROR nova.compute.manager [instance: 6c1c1fec-8da7-41cc-809f-069fb3dc49ed] 2021-09-08T11:01:26.416056Z qemu-kvm: -device vfio-pci,host=0000:2f:00.0,id=hostdev0,bus=pci.0,addr=0x5: vfio 0000:2f:00.0: group 19 is not viable 2021-09-08 13:02:00.954 7 ERROR nova.compute.manager [instance: 6c1c1fec-8da7-41cc-809f-069fb3dc49ed] Please ensure all devices within the iommu_group are bound to their vfio bus driver.

We discovered that GPU0 (PCI device 0000:2f:00.0) is in the same IOMMU group 19 as the InfiniBand device on the compute host (PCI device 0000:25:00.0).

If our understanding is correct, PCI devices in the same IOMMU group must be assigned to the host or to a single guest VM in a block.

In fact, after unbinding the InfiniBand device from its driver on the host, we successfully created the instance.

echo -n "0000:25:00.0" > /sys/bus/pci/drivers/mlx5_core/unbind

A possible solution is given below.

sudo driverctl set-override 0000:25:00.0 vfio-pci

driverctl set-override 0000:25:00.0 pci-stub

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Start-up problem when using A100 GPU in pass-through

Leave a Reply Cancel reply