Opened 11 years ago
Closed 10 years ago
#12076 closed defect (fixed)
QueryPerformanceCounter() Anomalies => Fixed in SVN
Reported by: | asdel | Owned by: | |
---|---|---|---|
Component: | other | Version: | VirtualBox 4.1.18 |
Keywords: | QueryPerformanceCounter counter timer | Cc: | |
Guest type: | Windows | Host type: | Linux |
Description
Host: Ubuntu 10.10
Guest: Windows XP Pro, v5.1 SP3
Anomalies are observed with 64-bit counter value provided via QueryPerformanceCounter(). These anomalies being to appear after the VirtualBox is running continuously for ~4 days. The counter frequency is 3,579,545 Hz; host/guest time synchronization is disabled (GetHostTimeDisabled = '1')
The guest calls QueryPerformanceCounter() every 50mS (0x2BB21 counts), logging the previous and current counter values when:
(a) the previous count > current count, or
(b) the current count is much larger than previous count (increase of 1 second or more, i.e., > 0x370000 counts)
The following counter query results are logged:
1. Last > Cur Time: 0x0000011B:7FFFC526 > 0x0000011A:80001858 2. Last > Cur Time: 0x00000135:7FFFE61B > 0x00000134:80002C01 3. Last > Cur Time: 0x00000136:7FFFDB9F > 0x00000135:800021C0 4. Last > Cur Time: 0x0000013E:7FFFEBC2 > 0x0000013D:80003983 5. Last > Cur Time: 0x00000146:7FFFBCFE > 0x00000145:800003E8 6. Last > Cur Time: 0x00000147:7FFFF8EA > 0x00000146:80003EDA 7. Last << Cur Time: 0x00000183:FFFFC9DC << 0x00000185:00005E92 8. Last > Cur Time: 0x00000185:7FFFD751 > 0x00000184:8000402C 9. Last > Cur Time: 0x00000189:7FFFD519 > 0x00000188:80006D8F 10. Last << Cur Time: 0x00000188:FFFFC076 << 0x0000018A:0000071D
At 3.579 MHz, the event of d31..d0 going from 0x7FFF:FFFF to 0x8000:0000 occurs every 1200 seconds (20 minutes).
Note the third and fifth cases, where the bits d35..d32 went from 0x6 (b0110) to 0x5 (b0101) -- this is a subtraction, not just a clearing of bit d32.
Log entries 7 and 10 appear to show bit d32 double-incrementing on the roll-over of d31..d0 from 0xFFFFFFFF to 0x1:00000000
The effect is, to the guest, time appears to jump forward or backward by 20 minutes.
-- end of description
Attachments (7)
Change History (24)
follow-up: 2 comment:1 by , 11 years ago
follow-up: 3 comment:2 by , 11 years ago
Replying to ramshankar:
As far as I can see, the latest TSC fixes in 4.2.x wasn't backported to 4.1.x release.
Could you, if possible, try upgrading to 4.2.18?
Yes, we will upgrade and rerun the test. Since the anomaly required several days to manifest, we will need to let the upgraded version run for a while before reporting back.
comment:3 by , 11 years ago
Now running v4.2.18. After approx 24 hrs, anomaly 'Type 1' has again appeared:
Last > Cur Time: 0x0000051:7FFFD7E2 > 0x00000050:80001F38
comment:4 by , 11 years ago
Thanks. Could you please upload the VBox.log for the VM you are seeing this issue.
by , 11 years ago
comment:6 by , 11 years ago
To reiterate: updated to 4.2.18 and the both forward and backward-jump anomalies are still present.
We are seeing the anomaly on some hardware platforms but not others (same VirtualBox host/guest arrangement on all platforms).
Note, the original problem description says QueryPerformanceCounter() is being called every 50mS, but it should say every 5mS (the count values reflect the 5mS period).
comment:7 by , 11 years ago
I can't find any official documentation that guarantees that QueryPerformanceCounter() is monotonically increasing nor do I know which time source it is looking at in the VM. In fact, from my understanding it determines the time source(s) at runtime. It's unlikely that it's reading the TSC in this case though.
I'm not ruling out a bug in our device emulation/interrupt delivery but we'll need a better understanding of what happens here (wraparound/sign extension/interrupt latency issues etc.)
From the frequency, it's most likely using the ACPI timer in which case the 32-bit wraparound is expected but how correctly Microsoft deals with this is a bit fuzzy. See also https://www.virtualbox.org/ticket/11951#comment:2
comment:8 by , 11 years ago
Thank you for the feedback. To close one thought: since my last posing, the anomaly has been observed on both our hardware platforms, so a hardware-unique issue appears unlikely.
by , 11 years ago
Attachment: | QPC_Check.cpp added |
---|
Source for Windows app demonstrating QueryPerformanceCounter anomaly
comment:9 by , 11 years ago
Update to report that the anomaly is also present with Mac OSX host (the companion .log file has been attached). I have also attached source code for a simple program that detects and reports any anomalies found.
An interesting consistency between the Linux and Mac platforms is that the anomaly only appears after 24+ hours of operation, and when it does appear, it is at first sporadic (i.e. occurring once every several hours). Eventually, it becomes more regular, and ultimately establishes a reliable 10-minute period.
Are there specific suggestions on how else we might assist in achieving a better understanding of the fundamental issue?
comment:10 by , 11 years ago
It would help to get the result from such a test program on real hardware which uses the same type of time source, i.e. the ACPI timer (sometimes called PMTIMER). The hardware implements only a 32 bit counter, and to me it looks like there is some quirk with extending it to 64 bits (honestly I have a hard time imagining how to do this safely at all with this brain-damaged hardware)...
comment:11 by , 11 years ago
I can confirm the issue with 4.3.6. The counter goes backward very often
Host: Windows 7 x64
Guest: Windows Server 2008 R2 x64
Core i5 2540M HT enabled, VT-x enabled, Nested paging enabled, 2 CPU
Here is the sample C# code:
using System; using System.Diagnostics; using System.Threading; namespace PerformanceCounterTest { class Program { static void Main(string[] args) { Stopwatch sw = new Stopwatch(); sw.Start(); long lastValue = sw.ElapsedTicks; for (int i = 0; i < 1000; i++) { Thread.Sleep(1); long elapsed = sw.ElapsedTicks; long diff = elapsed - lastValue; if (diff <= 0) Console.WriteLine("Loop: {0}, Error diff: {1}", i, diff); lastValue = elapsed; } sw.Stop(); Console.WriteLine("Stop"); } } }
Sample output (Release configuration, no debugging):
Loop: 2, Error diff: -263820 Loop: 13, Error diff: -239370 Loop: 16, Error diff: -261904 Loop: 20, Error diff: -261749 Loop: 51, Error diff: -261078 Loop: 57, Error diff: -261983 Loop: 61, Error diff: -259806 Loop: 78, Error diff: -261652 Loop: 85, Error diff: -261453 Loop: 100, Error diff: -262647 Loop: 113, Error diff: -262865 Loop: 116, Error diff: -261389 Loop: 145, Error diff: -261829 Loop: 152, Error diff: -261208 Loop: 169, Error diff: -261680 Loop: 186, Error diff: -262212 Loop: 218, Error diff: -261561 Loop: 240, Error diff: -261688 Loop: 246, Error diff: -261788 Loop: 250, Error diff: -259956
The same sample code has zero error output on the same machine running in host system. It has also zero error output on Virtual PC or VMWare virtualized machines.
comment:12 by , 11 years ago
Additional information, after restart of the virtual machine the issue has disappeared. Log of the broken session added.
After saving the machine state and restoring the session the issue has appeared again with less occurence:
Loop: 1, Error diff: -17926 Loop: 205, Error diff: -18108 Loop: 208, Error diff: -17730 Loop: 528, Error diff: -12459 Loop: 848, Error diff: -17473 Loop: 935, Error diff: -18373 Stop
comment:13 by , 11 years ago
Too bad that no one provided the information I tried to get in my previous reply - are any of the systems (physical or virtual) where the first test program runs without glitches which use PMTIMER (i.e. show a base frequency of 3.579 MHz)? I'm asking because this timer is only 32 bit, and it could be an OS bug which is hidden by the fact that Windows usually doesn't pick this time source.
Also, the 2nd test program clearly shows a very different pattern and thus there might be a different issue behind it - but as there is no information what timer is used for the C# this is even harder to make any progress than with the original one. The only bit of useful information I can take from the log is that the TSC rate is very unusual... it is measured as 2.491681920 GHz, whereas the CPU declares itself to run at 2.60GHz. That's over 4% off. Also, there are suspicious RTC timer rate changes (first few very short, just ~15 msecs, but the 2nd last one 5 seconds and the last one over 83 seconds) - but they don't match your test program, which takes samples every millisecond, until it reaches 1000 loops, so overall it'll run for a bit more than a second.
comment:14 by , 11 years ago
I am not that familiar with all these timers so what I am writing might be completely wrong... If this is the case please correct me...
Wouldn't it be sufficient to specify /USEPMTIMER in windows xp boot.ini file. As far as I understood it this will force windows to use the PMTIMER. I have done this on one pysical test machine and the frequency returned from QueryPerformanceFrequency has changed from something close to cpu frequency which indicates that TSC is used to 0x369e99 which is 3.579 MHz which indicates that PMTIMER is used. I have started our test program yesterday evening. As we only see the time jumping after more than 24 hours uptime I will report about the results tomorrow or maybe even better the day after tomorrow. I will post the code of our test program together with the results but it is almost the same compared to the test program of asdel.
If I am wrong it would be nice if someone can post a comment so that I can stop this and spare some time...
comment:15 by , 11 years ago
Here are the logs both "Virtual Machine 1" and "Virtual Machine 2" are affected by the problem the "Physical Machine" works like a charm.
I will let the machines run so that we can see that the frequency of the time jump increases with increasing uptime of the virtual machines.
Virtual Machine 1
Frequency: 369e99 2014-02-17 14:16:20.978 Overflow from lower part are 1199 seconds ActTime=1cf51e ActPerfTime=195b7eab3 2014-02-20 00:14:15.868: PerfError -1198.862613 ActPerfTime=ae8013a59c 2014-02-20 01:14:16.075: PerfError -1198.863378 ActPerfTime=b08035319e snapshot: 201402200739
Virtual Machine 2
Frequency: 369e99 2014-02-17 14:15:30.512 Overflow from lower part are 1199 seconds ActTime=1c284b ActPerfTime=18a892e82 2014-02-20 05:34:16.174: PerfError -1198.862579 ActPerfTime=be802f9ccf 2014-02-20 07:14:14.829: PerfError -1198.862586 ActPerfTime=c2800b4370 snapshot: 201402200739
Physical Machine
Frequency: 369e99 2014-02-17 16:08:45.828 Overflow from lower part are 1199 seconds ActTime=4c2c ActPerfTime=45f11bd snapshot: 201402200740
by , 11 years ago
Attachment: | virtualMachine1.log added |
---|
by , 11 years ago
Attachment: | virtualMachine2.log added |
---|
by , 11 years ago
Attachment: | checkTime.cpp added |
---|
comment:16 by , 10 years ago
Summary: | QueryPerformanceCounter() Anomalies → QueryPerformanceCounter() Anomalies => Fixed in SVN |
---|
We think we finally found and fixed the problem. The next 4.3.x maintenance release will contain the fix.
As far as I can see, the latest TSC fixes in 4.2.x wasn't backported to 4.1.x release.
Could you, if possible, try upgrading to 4.2.18?