Opened 5 years ago
Closed 5 years ago
#19311 closed defect (invalid)
Changes on Archlinux 1/27/20 to support Linux 5.5 cause 10 min. boot of guest after downgrade to Linux 5.4
Reported by: | David C Rankin | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 5.2.36 |
Keywords: | Cc: | ||
Guest type: | Linux | Host type: | Linux |
Description
This issue was first discussed on the virtualbox-users list on 2/4/20 under the thread "[VBox-users] 5.5 kernel and 5.2.36 - Does this not work with the new 5.2.36?"
Following direction to https://www.virtualbox.org/ticket/19145 I was directed to file a separate bug and submit the strace and dmesg output for the guest here.
I build the 5.2.x branch for the AUR repository. https://aur.archlinux.org/packages/virtualbox-bin-5/
During update on 2/4/20, Archlinux moved to Linux 5.5 kernel. The virtualbox drivers built and on reboot starting my Archlinux guest headless, virtualbox reports that the guest started. 5 minutes later the boot had yet to complete and a poweroff was issued. So I built with 5.2.36 - same result, so I built with testcase 5.2.37 (135942), same result.
After posting to the vbox-users list and being informed of https://www.virtualbox.org/ticket/19145 and 5.2 not yet supporting Linux 5.5 kernel, I downgraded my host. The same problem remained, but this time I just left the boot of a simple Archlinux guest that boot to a text console running. After 10 minutes -- the boot succeeded, but the guest was unusably slow.
For example, users cannot login because the 90 second login timeout expires between the time the user enters their user name and presses [Enter] and the display of the 'password:' prompt. root can login as there is no 90 second timeout (takes about 120 seconds for 'password:' to display after entering user name.
All was fine on 1/27/20 when I last updated the guest to the Linux 5.4.14 kernel. I have downgraded both the host and guest multiple times to the package/kernel state they were in on 1/14/20, 1/21/20, 1/24/20, and 1/27/20. The guests are still unusable. It is like a config or something similar was modified to support the 5.5 kernel that isn't reverted on downgrade (or something similar) that is causing the problem. That's the only reason I can think that downgrading the entire system before the point the problem appeared does not resole the issue. I have a number of guests I use, and all was well before the 5.5 update.
I have attached a strace of the communication between the Archlinux host and Archlinux guest. I am no expert interpreting the strace communications, but when each bit of I/O is taking more than 10 seconds -- that looks like an issue to me. I have also attached the dmesg output from the Archlinux guest after waiting though the 10 minute boot (it normally boots fully in less than 20 seconds).
I'm happy to provide anything else you can think of that may identify the issue.
Attachments (5)
Change History (14)
by , 5 years ago
Attachment: | dmesg+strace_vbox-5.2.tar.xz added |
---|
by , 5 years ago
Attachment: | dmesg+strace_vbox-headless-5.2.tar.xz added |
---|
Additional dmesg + strace from vbox-headless session
comment:1 by , 5 years ago
In the new attachment from the headless session, the strace shows what looks like some type of race condition. There are a few hundred lines at the beginning, ending with write(1, "Oracle VM VirtualBox Headless In"..., 1471) = 1471
and exit_group(1)
and +++ exited with 1 +++
Then 40,000 lines of read(9, "\372", 1)
followed by select(10, [9], NULL, [9], NULL)
with an occasional futex(0x23e2f88, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
Then at the end, roughly 100 lines of FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1581693996, tv_nsec=146944000}, FUTEX_BITSET_MATCH_ANY) = -1 EAGAIN (Resource temporarily unavailable)
Something is spinning in circles and it is bringing the system to it knees. This last session I did an update to the Guest that normally takes less than 2 minutes to complete -- 4 hours later, the update finally completed.
by , 5 years ago
Attachment: | 20200222-025921_vbox_err.log.xz added |
---|
strace between Linux Host/Guest during upgrade of guest to Linux 5.5
comment:2 by , 5 years ago
Troubleshooting this issue, I managed to get the Archlinux guest updated to Linux 5.5 while running testcase 5.2.37 (135942), but the guest additions failed to build. I have opened a separate bug attaching the vboxadd-setup.log to https://www.virtualbox.org/ticket/19337.
I captured the strace of the host process which I included as an additional attachment here. Guest under 5.5 continue to be unusably slow. Several hours were required for this last update of the guest that would normally have taken 5 minutes. Let me know if I can provide anything further.
comment:3 by , 5 years ago
if someone wants to play with this, here's the Archlinux installation guide:
comment:4 by , 5 years ago
Here is my package that I build for AUR - https://aur.archlinux.org/packages/virtualbox-bin-5/
I have updated it to 5.2.38 and the same issue remains. I am currently on minute 10 of trying to boot the Archlinux guest. I'll add a screenshot of the guest failures. Thank you for any help you can give.
by , 5 years ago
Attachment: | vbox_5.2.38.png added |
---|
Screenshot of headless guest attempting to boot (about 6 minutes)
by , 5 years ago
Attachment: | virtualbox-ext-oracle-5-5.2.38-1.src.tar.gz added |
---|
Archlinux Source Build for virtualbox-ext-oracle_5.2.38 to go along with AUR build
comment:5 by , 5 years ago
When I look at top on both host and guest, the cpu utilization generally less that 4% on both host and guest with occasional cpu utilization up to 30% or so when data is processed. That generally lasts for no more than a second or two, then utilization drops back to idle. It is like there is an I/O deadlock between the host and guest, that will only let a few bytes through at a time.
iostat -c on the guest shows:
Linux 5.5.5-arch1-1 (vl1) 02/24/2020 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 0.14 0.00 0.18 54.55 0.00 45.14
iostat -d on the guest shows
Linux 5.5.5-arch1-1 (vl1) 02/24/2020 _x86_64_ (2 CPU) Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd sda 3.82 102.85 61.57 0.00 323790 193848 0 scd0 0.01 0.02 0.00 0.00 71 0 0
I'm not sure what that adds but the %iowait looks high. I don't know what it normally is for the guest, but I'm trying to run just about any tool I can look at to rule-in/out different issues as being related to this problem.
If you can think of anything else that I can post that would help -- just let me know.
comment:6 by , 5 years ago
Here is a new piece of information. I have another server, a backup, that has not been updated since 1/25/20 that has the same Archlinux host/guest setup. (along with openSUSE guests). It is running the 5.4.14 kernel. I booted today to check and it runs PERFECTLY. No slowness, Archlinux guest boots in 18 seconds, iostat waits are minimal and it is a joy to use.
I will not update that box until we have resolved this bug as it can provide whatever information may be requested on the pre-update state when things were working. I haven't been able to make any further determination of what the problem is. I know the host/guest I/O is broken for 5.2 with whatever Archlinux changes were made to support the 5.5 kernel. (and since Arch is bleeding edge, others will likely hit this problem when their distro moves forward) I just do not know enough about the internal host/guest communications to know where/what to look at.
Any help would be appreciated and I'm happy to pull whatever information from the working or non-working systems you may care to see.
comment:7 by , 5 years ago
Hold off testing, this may to be an issue with the Linux 5.5 kernel and the mdadm Raid1 array the guest is on where the Raid1 is comprised of 2-disks instead of 2-partitions. I have posted to the kernel-raid mailing list (under title "Linux 5.5 Breaks Raid1 on Device instead of Partition, Unusable I/O") and will provide the feedback I receive. (gotta love kernel version updates...) A Raid1 I/O issue would explain all the slowdown symptoms seen:
For what it is worth the Raid1 details are:
[14:17 valkyrie:/home/david/tmp] # mdadm -E /dev/sdc /dev/sdc: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875 Name : valkyrie:4 (local to host valkyrie) Creation Time : Mon Mar 21 02:27:21 2016 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB) Array Size : 2930135488 (2794.39 GiB 3000.46 GB) Used Dev Size : 5860270976 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=48 sectors State : clean Device UUID : e15f0ea7:7e973d0c:f7ae51a1:9ee4b3a4 Internal Bitmap : 8 sectors from superblock Update Time : Sun Mar 1 14:18:07 2020 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 62472be - correct Events : 8193 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing, 'R' == replacing) [14:18 valkyrie:/home/david/tmp] # mdadm -E /dev/sdd /dev/sdd: Magic : a92b4efc Version : 1.2 Feature Map : 0x1 Array UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875 Name : valkyrie:4 (local to host valkyrie) Creation Time : Mon Mar 21 02:27:21 2016 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 5860271024 (2794.39 GiB 3000.46 GB) Array Size : 2930135488 (2794.39 GiB 3000.46 GB) Used Dev Size : 5860270976 (2794.39 GiB 3000.46 GB) Data Offset : 262144 sectors Super Offset : 8 sectors Unused Space : before=262056 sectors, after=48 sectors State : clean Device UUID : f745d11a:c323f477:71f8a0d9:27d8c717 Internal Bitmap : 8 sectors from superblock Update Time : Sun Mar 1 14:18:15 2020 Bad Block Log : 512 entries available at offset 72 sectors Checksum : 9101220e - correct Events : 8194 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing, 'R' == replacing) [14:18 valkyrie:/home/david/tmp] # mdadm -D /dev/md4 /dev/md4: Version : 1.2 Creation Time : Mon Mar 21 02:27:21 2016 Raid Level : raid1 Array Size : 2930135488 (2794.39 GiB 3000.46 GB) Used Dev Size : 2930135488 (2794.39 GiB 3000.46 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Mar 1 14:18:32 2020 State : clean, checking Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Consistency Policy : bitmap Check Status : 1% complete Name : valkyrie:4 (local to host valkyrie) UUID : 6e520607:f152d8b9:dd2a3bec:5f9dc875 Events : 8194 Number Major Minor RaidDevice State 0 8 32 0 active sync /dev/sdc 2 8 48 1 active sync /dev/sdd
comment:8 by , 5 years ago
This bug can be closed. The problem was the underlying disk /dev/sdc. As you can see above, there are no errors regarding the array and there were no errors in the journal, but when drilling down with iostat, the /dev/sdc disk was suffering high io_wait which caused the guest that was stored on the array fits essentially bringing it to it's knees. The disk was replaced and array rebuilt and all guests are now happy on 5.2.38 and the Linux 5.5.7 kernel. The fact that no errors were generated regarding the disk failure and the coincidence of the problem occurring at the time of the Linux 5.4 -> 5.5 update, made diagnosing this issue a bit of a rabbit trail. Thank you for your help.
Please close this ticket as (invalid/not-a-bug).
comment:9 by , 5 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
strace of I/O between Host/Guest, dmesg from Guest