VirtualBox

Opened 2 days ago

Last modified 95 minutes ago

#22364 new defect

NAT is completely broken for ALL published 7.1.x versions

Reported by: tempdrive Owned by:
Component: network/NAT Version: VirtualBox-7.1.6
Keywords: NAT Cc:
Guest type: Windows Host type: Windows

Description (last modified by tempdrive)

Please consider this ticket as the ultimate summary and workaround trial and error source.

Here is a summary of what you will find here:

  • The means to trigger the bug in 3-5 minutes for all published version of 7.1 branch (script attached, showcasing video linked due to extensive size)
  • Observations regarding the bug and failed workarounds (log files and network trace files linked due to extensive size)
  • Steps performed to narrow down the origin of the bug

My apologies for the lengthy description in advance, but I would like to make sure that everyone has a clear understanding of what I am trying to describe.

Usually it takes a while until people hit the wall with the seemingly obscure issue, which has been around since the very first published version (7.1.0 BETA1 r164171) in the exact same manner, even resulting in some user switching to other products in desperation. I have spent a few dozens of hours trying to making progress on this topic, as it seems the developers are struggling to make amends. The trial and error listed here may seem irrelevant in certain cases, but I wanted to make sure to narrow down the issue to its core as much as possible.

I have managed to come up with steps to trigger the completely failure of NAT networking in the span of 3-5 minutes, depending on the hardware and system load. You will find a PowerShell script attached that you can use to trigger the issue that everyone is facing in one way or another. I am using Windows Host and Windows Guest, but the script can be easily transformed for other systems as well as it relies on the most basic concepts. All it does is triggering continuous API calls towards a public endpoint, that does not seem to have a problem with it. Another endpoint I was using before got me temporary banned due to the countless attempts within a certain time frame, but even if that happens, it still triggers the bug, because all that is required heavy traffic, which will be present in the outgoing calls and the corresponding responses. The amount of transferred data is irrelevant - you will be see from the shared showcasing video that the transferred data was just below 200 MB before the network went down. The script will exit the loop when reaching 10 consecutive errors. Everything else is part of the normal networking behavior from both sender and receiver. To achieve the advertised runtime, I recommend parallel execution of 9 threads - a single execution will take about 20-25 minutes of time, but the results are exactly the same no matter how many parallel processes there are. The primary focus was to minimize the time required to trigger the failure so that several workaround attempts can be ruled out, and it should also help with testing the attempted solutions implemented. Please note that parallel execution will require more resources in terms of RAM (5-6 GB is recommended) and network bandwidth (10MBit connection). Less resources only mean longer execution times required to trigger the issue.

The showcasing video is over 100 MB, therefore I can only link it: https://mega.nz/file/xUcw1QZb#zTFZVItcLdKg17UYChw8EcZK-TG2sSXYDaZ3FGqJXkg

The video demonstrates the usage of the script and the bug triggered. It also displays that after simply rebooting the system, the network remains broken. Before the video was created, I have also created log files and network tracing, also exceeding this size as the complete traffic was captured. The larger .pcap file is from start of the VM until it is rebooted, while the small .pcap file covers the unavailable network after the reboot completed. https://mega.nz/file/Vc8H3JpK#qutus4zdEfjlVa1D2mlk1pBfKBRCGh2aapsAFUZm6iA

Key points to consider regarding the bug itself:

  • Just as demonstrated in the video, it can be triggered the same way for all versions between 7.1.0 BETA1 r164171 and Oracle 7.1.7 r168159 well under 10 minutes with minimal effort
  • No version of the 7.0.x branch showed signs of this bug even after letting the VMs run with the script in parallel execution for 1 hour

The tested versions were 7.0.11.x Beta build I had initially on my test machine, 7.0.18 r162988 and 7.0.20 r163906 official releases, which are from around the time of the initial development work and BETA 1 release of 7.1.x branch (to see if the bug was around that time in the previous branch), and 7.0.25 r167944 as the most recent test build

  • When the bug happens, simply restarting the machine does not fix the networking, the process has to be closed from the memory and started clean to get it work again
  • Setting the IP manually on the Guest machine does not prevent the issue and can cause addition problems with shutting down the machine when the bug is triggered
  • Disabling IPv6 on the Guest machine network properties does not prevent the issue
  • When the bug happens, disconnecting and reconnecting the network adapter does not solve the issue
  • When the bug happens, setting the network adapter to "Not Attached" and back to "NAT" does not solve the issue
  • When the bug happens, using the ipconfig /release and /renew commands does not solve the issue
  • The only workaround to get the network back up is to completely shut down the Guest machine and start it again, which will only work until the bug is triggered again

Once all these conclusions were made, I started looking into ways to narrow down the origin of the bug. Given my limited knowledge regarding the code base, I started replacing binaries for both 7.1.0 BETA1 r164171 and 7.1.7 r168159 with different versions of the 7.0 branch to see if the bug still remains.

Here is a list of drivers and binaries only focusing on the latest test build I was able to replace and thus creating a hybrid version while maintaining the ability to start the VM.

These tests were performed cumulatively (each new line represents a change done on top of the previous one)

  • replaced "netadp6" driver 7.1.7.18159 (r168159) installation with 7.0.25.17944 (r167944): network broke
  • replaced "netlwf" driver 7.1.7.18159 (r168159) installation with 7.0.25.17944 (r167944): network broke
  • replaced "VBoxNetDHCP.exe" and "VBoxNetDHCP.dll" of 7.1.7-168159 with the files from 7.0.25-167944: network broke
  • replaced "VBoxNetNAT.exe" and "VBoxNetNAT.dll" of 7.1.7-168159 with the files from 7.0.25-167944: network broke
  • replaced "VBoxLibSsh.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxHostChannel.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxRes.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxAuth.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxAuthSimple.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxGuestControlSvc.dll" of 7.0.25-167944 with the file from 7.0.25-167944: network broke
  • replaced "VBoxGuestPropSvc.dll" of 7.0.25-167944 with the file from 7.0.25-167944: network broke
  • replaced "VBoxAudioTest.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxBalloonCtrl.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxBugReport.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxDragAndDropSvc.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxDrvInst.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxDTrace.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxExtPackHelperApp.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxHeadless.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxHeadless.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxAutostartSvc.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxWebSrv.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "DbgPlugInDiggers.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxDbg.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxManage.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke

partial replacements:

  • replaced "VirtualBoxVM.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxSupLib.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxCAPI.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxSDS.exe" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • replaced "VBoxDDU.dll" of 7.1.7-168159 with the file from 7.0.25-167944: network broke
  • renamed "vbox-img.exe" of 7.1.7-168159: network broke

additionally tried:

  • uninstalled Extension Pack: network broke
  • uninstalled Guest Additions from the Guest machine: network broke

I did the same for the first BETA to see if things would behave differently, but they did not. I would assume that the QT libraries and UICommon.dll are merely responsible for the visuals, which would leave VBoxC.dll, VBoxDD.dll, VBoxDD2.dll, VBoxDDR0.r0, VBoxRT.dll, VBoxSVC.exe, VBoxVMM.dll, VirtualBox.exe, VirtualBoxVM.dll, VMMR0.r0 and VBoxSup.sys as candidates to contain the bug. I have highlighted the files I suspect being the most relevant.

I hope my efforts will greatly aid you in finally eliminating this system-breaking bug once and for all. Please consider including similar testing for networking as demonstrated for future releases.

Attachments (1)

NetStabilityTester.ps1 (2.2 KB ) - added by tempdrive 2 days ago.
The approach incorporated in this script can be used to quickly test the stability of the network

Download all attachments as: .zip

Change History (4)

by tempdrive, 2 days ago

Attachment: NetStabilityTester.ps1 added

The approach incorporated in this script can be used to quickly test the stability of the network

comment:1 by tempdrive, 2 days ago

Description: modified (diff)

comment:2 by Klaus Espenlaub, 33 hours ago

Much appreciated, this will help with speeding up the bug hunt.

Note that your file replacement experiments had essentially no chance to produce a hybrid with non-broken NAT. The NAT driver lives inside VBoxDD.dll, which also contains a ton of other device emulation and driver code which has subtle but incompatible changes between 7.0 and 7.1.

The NAT engine up to 7.0 was a heavily tweaked version of slirp, and with 7.1 we switched to an up to date version of libslirp. The latter is derived/improved from the former, but something with the integration appears to have gone subtly sideways.

comment:3 by tempdrive, 95 minutes ago

I can confirm with a thorough testing that r168266 contains a permanent fix for the issue.

The point of the file replacement was merely to limit the focus to the relevant files, not to assemble a hybrid program by any means, as I am yet to learn how to browse the code repository properly, let alone coming up with an actual advice or solution. I was pretty much at my capacity regarding doing any more, and I am glad it was not needed.

Thank you so much once again for all your efforts.

Note: See TracTickets for help on using tickets.

© 2025 Oracle Support Privacy / Do Not Sell My Info Terms of Use Trademark Policy Automated Access Etiquette