Home / File systems became read-only on virtual guests on VMware

File systems became read-only on virtual guests on VMware

Filesystem entered ready-only mode, and the following SCSI I/O fault incidents were recorded in syslog messages.

kernel: sd 0:0:0:0: timing out command, waited 1080s
kernel: sd 0:0:0:0: Unhandled error code
kernel: sd 0:0:0:0: SCSI error: return code = 0x06000008
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK

If the failover occurs before the timeout, everything will proceed as usual. If the timeout is increased, more time is given for the failover to take place.

Increase the SCSI timeout of each disk presented from VMWare

Increase the hung task timeout value by roughly two or three times the new device timeout value when increasing the device timeout value. This will delay the execution of the hung task logic until after the IO has timed out and had at least one retry. There are various ways to implement this setting.

Option 1

Install the most recent version of VMware-tools, which includes udev rules that modify the SCSI timeout default. The rules are often available in the file 99-vmware-scsi-udev.rules in /etc/udev/rules.d.

Option 2

Make a rule in /etc/rc.d/rc.local that modifies all system disks' timeout values during system startup.

# cat /etc/rc.d/rc.local

#!/bin/sh

#

# This script will be executed *after* all the other init scripts.

# You can put your own initialization stuff in here if you don't

# want to do the full Sys V style init stuff.

touch /var/lock/subsys/local

# Increase SCSI timeout on all SCSI disks attached to the system at boot time

echo 180 > /sys/block/sd*/device/timeout

Option 3

Make your own udev rule to modify the timeout each time a disk is connected to the system. This applies to both the boot process and system operation.

For Centos 7 and Centos 8 (systems):

 cat /etc/udev/rules.d/99-scsi-timeout.rules
KERNEL="sd*", SYSFS{vendor}="VMware", SYSFS{model}="Virtual Disk", ATTR{device/timeout}=180

For Centos 4, 5, 6 (systemv):

# cat /etc/udev/rules.d/99-scsi-timeout.rules
KERNEL="sd*", SYSFS{vendor}="VMware", SYSFS{model}="Virtual Disk", NAME="%k", PROGRAM="/bin/sh -c 'echo 180 > /sys/block/%k/device/timeout'"

Option 4

Use device-mapper-multipath with the queue_if_no_path option set, even if there is only one path. This will stop errors from spreading back to the filesystem by forcing the device mapper to retry failed I/Os endlessly until they succeed.

The drawback to this strategy is that if the storage never returns, there will be I/O requests locked in the multipath layer, making it impossible to unmount the disk without restarting the OS. Additionally, more data will likely be lost since more data may be queued before any program detects an issue.

Maximum allowed device timeout value

A suggested starting point is 180 seconds. More significant numbers can be required depending on the available storage type, the number of guests, their priority, the total I/O burst load across all guests, and other variables.

The internal kernel variable is in milliseconds, however, the kernel's maximum timeout value is 0x7FFFFFFF.

0x7FFFFFFF = 2147483647 / 1000ms = 2147483 seconds or about 25 days.

# echo 2147483 > /sys/block/sdc/device/timeout

# cat /sys/block/sdc/device/timeout

2147483

Of course, you wouldn't want to make it so high, but in VMware setups, timeout settings of 180, 360, 600, or even a little bit higher are fairly uncommon. Which value is required again depends on the aforementioned elements: storage type, visitor count, I/O load, visitor priority, etc.

Changing /proc/sys/kernel/hung_task_timeout_secs to track new timeout value

A process remains in the "D" state as it waits for I/O to finish. A warning message and a kernel stack dump are output to messages if a process waits longer than the hung task timeout secs seconds for the process to exit the D state. The hung task timeout secs are typically set to 120 seconds with a default I/O timeout value of 60 seconds.

To prevent hung task output while the timeout value for I/O is still active, the hung task timeout secs should also be increased if the timeout value is increased. It is advised to set hung task timeout secs to at least I/O timeout+60 seconds and a maximum of twice the timeout value.

# echo <value in seconds> > /proc/sys/kernel/hung_task_timeout_secs

Leave a Reply