-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RCU CPU stall warning in a multi-core system simulation #51
Comments
Although implementing multi-threaded system emulation can significantly alleviate this issue, I think it does not mean the problem will no longer occur after multi-threaded system emulation is complete. Perhaps we can start by optimizing the timer. Currently, the function In #49, it was suggested that lowering the frequency set in To strike a balance between these two extremes, I think we can maintain a dedicated emulator timer and updating it at the start of each emulation cycle. This way, I made some modifications to test this approach, and it did result in a slight performance improvement for the emulator. On my machine, with SMP=6, the RCU CPU Stall warning no longer appears. However, with SMP=8, the warning still occurs. |
I completely agree that your implementation can help avoid RCU CPU stall warnings. However, as the number of simulated cores increases, the warning is likely to occur again unless the emulation period per cycle is also increased proportionally to the number of cores. Is my understanding correct? I rebuilt the Linux kernel with
But preemption may increase the frequency of context switches, causing overall CPU usage to increase. |
yes, the improvement is limited, as the number of cores grow up, the warning would occur again. |
The accuracy of the timer primarily affects user programs. I think that during the boot process, it is unnecessary to use such a high-precision timer. Instead, a less precise timer, or even one that simply increments in a straightforward manner, could be used. After the boot process is complete, the system can switch back to a more precise timer. Here is a simple example. I added a global flag static void op_sret(hart_t *vm)
{
/* Restore from stack */
vm->pc = vm->sepc;
mmu_invalidate(vm);
vm->s_mode = vm->sstatus_spp;
vm->sstatus_sie = vm->sstatus_spie;
/* After the booting process is complete, initrd will be loaded. At this
* point, the sytstem will switch to U mode for the first time. Therefore,
* by checking whether the switch to U mode has already occurred, we can
* determine if the boot process has been completed.
*/
if (!boot_complete && !vm->s_mode)
boot_complete = true;
/* Reset stack */
vm->sstatus_spp = false;
vm->sstatus_spie = true;
} Before the boot process is complete, I didn't use bool boot_complete = false;
static struct timespec host_time;
// ...
void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
timer->freq = freq;
clock_gettime(CLOCKID, &host_time);
semu_timer_rebase(timer, 0);
}
static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
if (boot_complete) {
clock_gettime(CLOCKID, &host_time);
return host_time.tv_sec * freq +
mult_frac(host_time.tv_nsec, freq, 1e9);
} else {
return host_time.tv_sec * freq +
mult_frac(host_time.tv_nsec++, freq / 1000, 1e9);
}
// ...
#endif
}
uint64_t semu_timer_get(semu_timer_t *timer)
{
/* Rebase the timer to the current time after the boot process. */
static bool first = true;
if (first && boot_complete) {
first = false;
timer->begin = semu_timer_clocksource(timer->freq);
}
return semu_timer_clocksource(timer->freq) - timer->begin;
}
// ... This is the sample output:
|
I agree that your changes can quickly and easily resolve the RCU stall warning issue! However, your log cannot represent the actual boot time in this situation ( I try to reproduce your work, the RCU stall warning is resolved when simulate SMP=32 [ 0.007006] Run /init as init process
[ 0.026121] hrtimer: interrupt took 12000723 ns
Starting syslogd: OK another message A quick research that this warning is produced by I'm not sure is any side effect here |
Yes, the primary reason for incrementing Once the system switches to U mode for the first time, the As for the HRT warning, I think it was triggered due to a sudden jump in the system clock right after the boot process completed. I'm not sure is any side effect here too. |
This option was set for the sake of benchmarking purpose. RT-Tests relies on HRT features. |
The timer increments should align with the frequency defined in the device tree. We could use an approach similar to BogoMips to make the necessary adjustments. |
After multiple attempts, I realize that independently maintaining In contrast, I believe maintaining a frequency scaling factor is a better solution. As mentioned in #49, this achieves the purpose of slowing down time during the boot process. Since it’s merely a scaling factor, we can still derive real-time values from it. As for the Interestingly, on my machine, the warning doesn't appear at all with SMP=16, regardless of how the frequency is adjusted. Moreover, the current implementation only affects the timer during the boot process; after switching to U-mode, the timer behaves exactly as before. Therefore, I believe this warning is due to the current sequential emulation approach. After the multi-threaded emulation is implemented, I think the situation would be mitigated a lot. Here’s a diagram of the overall flow: Below is a simple example: static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
struct timespec t;
clock_gettime(CLOCKID, &t);
if (boot_complete)
return t.tv_sec * freq + mult_frac(t.tv_nsec, freq, 1e9);
else
return t.tv_sec * (freq / 100) + mult_frac(t.tv_nsec, (freq / 100), 1e9);
#elif defined(HAVE_MACH_TIMER)
static mach_timebase_info_data_t t;
if (t.denom == 0)
(void) mach_timebase_info(&t);
return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
1e9);
#else
return time(0) * freq;
#endif
}
uint64_t semu_timer_get(semu_timer_t *timer)
{
static bool first = true;
if (first && boot_complete) {
first = false;
semu_timer_rebase(timer, 0);
printf("\033[1;33m[SEMU LOG]: Switch to real time\033[0m\n");
}
return semu_timer_clocksource(timer->freq) - timer->begin;
} I think this approach is better than maintaining two separate timers during the boot process. Dividing the frequency by 100 means the boot process operates at one one-hundredth of real-time, allowing us to easily derive the actual boot time. This scaling factor can also be configured in the Makefile. I attempted to dynamically measure the cost of Here is the output of the test:
Another output for the factor set to 50:
Also another output of the factor set to 10:
In my environment, even with varying scale factors, the hrtimer warning consistently appeared at approximately 60000000 ns. I think this observation supports my hypothesis. |
Here is a summary of two potential approaches to mitigate RCU CPU stalls under the current sequentially-emulation scenario, we have two methods now Methods
1. Scale FrequencyThis method involves calling Pros
Cons
2. Manually Increment
|
SMP | times call semu_timer_clocksource |
time(sec) of boot process | hrtimer warning |
---|---|---|---|
1 | 223,992,364 | 3.40001 | |
2 | 382,486,686 | 8.01002 | |
3 | 577,491,593 | 13.44003 | |
4 | 774,125,110 | 17.85185 | |
5 | 973,274,729 | 22.94007 | |
6 | 1,174,038,398 | 27.11009 | |
7 | 1,377,244,622 | 31.80010 | |
8 | 1,605,001,986 | 37.52011 | |
9 | 1,793,136,295 | 41.41014 | |
10 | 2,005,988,752 | 45.53015 | |
11 | 2,220,126,569 | 51.66018 | |
12 | 2,440,897,255 | 56.13018 | |
13 | 2,651,860,790 | 60.71019 | |
14 | 2,882,701,067 | 65.92020 | |
15 | 3,103,978,838 | 70.30022 | |
16 | 3,343,030,072 | 76.31025 | |
17 | 3,566,365,881 | 80.24026 | |
18 | 3,800,214,669 | 86.59028 | |
19 | 4,031,961,176 | 92.00030 | |
20 | 4,280,331,336 | 94.47030 | |
21 | 4,516,731,902 | 101.68033 | |
22 | 4,883,959,327 | 104.95035 | |
23 | 5,143,022,258 | 110.69036 | |
24 | 5,260,058,753 | 118.59098 | |
25 | 5,526,277,854 | 125.30041 | |
26 | 5,790,681,086 | 132.98045 | 50000184 ns |
27 | 6,044,658,240 | 140.04046 | 80000307 ns |
28 | 6,328,119,424 | 146.18047 | 60000231 ns |
29 | 6,598,156,499 | 154.15050 | 80000261 ns |
30 | 6,868,480,625 | 159.83052 | 90000308 ns |
31 | 7,129,979,196 | 163.82054 | 50000169 ns |
32 | 7,410,129,712 | 170.80054 | 80000508 ns |
Tests were also conducted on my workstation:
SMP | times call semu_timer_clocksource |
time(sec) of boot process | hrtimer warning |
---|---|---|---|
1 | 223,450,834 | 15.21302 | |
2 | 388,551,174 | 31.45406 | |
3 | 586,279,749 | 48.33009 | |
4 | 791,644,232 | 68.00714 | |
5 | 1,003,639,012 | 83.64418 | |
6 | 1,216,761,778 | 99.95122 | 12000031 ns |
7 | 1,438,276,507 | 120.21144 | 14000047 ns |
8 | 1,704,344,789 | 122.50440 | 11000030 ns |
9 | 1,900,605,464 | 156.91848 | 10000031 ns |
10 | 2,140,147,966 | 176.43249 | 11000031 ns |
11 | 2,451,031,756 | 179.20599 | 12000062 ns |
12 | 2,633,717,918 | 217.70393 | 14000046 ns |
13 | 2,993,790,985 | 216.13076 | 15000046 ns |
14 | 3,165,383,012 | 262.75081 | 14000046 ns |
15 | 3,437,855,090 | 286.43180 | 15000015 ns |
Since the workstation was slow, the execution time was long. Thus I just statistics until SMP=15
.
Target Time Configuration
To use the second method, a target time need to be determined. If a target boot time of 10 seconds is set, nsec
increment values can be calculated based on the SMP parameter.
For example, with SMP=4
and a target time of 10 seconds (semu_timer_clocksource
adds approximately:
to nsec
.
However, this method may introduce timing discrepancies across different environments. For instance, with SMP=1
, the boot process takes approximately 3 seconds on my personal computer but 18 seconds on my workstation, resulting in a sixfold difference.
This leads to an implicit problem: if adding a core increases the number of semu_timer_clocksource
calls by an approximate number semu_timer_clocksource
will increment nsec
by approximately:
where SMPs
represents the number followed by SMP parameter.
Under this method, the time in emulator during boot process is calculated as:
If the assumption of semu_timer_clocksource
calls exceeds
The value
$2 \times 10^8$ was derived from tests on my personal computer and workstation. Despite the sixfold difference in execution times, the number ofsemu_timer_clocksource
calls was remarkably consistent, leading to this assumption. The corresponding numbers could be checked in the tables above: for each increment of the SMP parameter, the number of calls tosemu_timer_clocksource
roughly increases by$2 \times 10^8$ .
If we still want a coarse-grained timer during the boot process to roughly approximate real-world time, clock_gettime
could be called at specific intervals (e.g., every
However, if the actual number of semu_timer_clocksource
calls exceeds clock_gettime
may lead to time regression if real-world time is less than emulation time.
In contrast, if the number of semu_timer_clocksource
calls is too low, time will continue to increment, leading only to a deviation that can be corrected via rebase.
Although manually incrementing nsec
avoids calling clock_gettime
, differences in execution time across environments reduce the stability of RCU CPU stall warning mitigation. This also eliminates the ability to correlate boot process logs with real-world time.
Nonetheless, since boot process timing may not be critical, meanwhile, as the number of harts increment, we can easily notice that the execution time of boot process is getting longer and longer, so I think the benefits of avoiding the call of clock_gettime
still remain attractive.
In my opinion,
- If boot time accuracy matters: Use scaled frequency and continue relying on
clock_gettime
to updatensec
. There is an simple example code in the previous discussion of this issue. - If boot time accuracy does not matter: Manually update
nsec
without scaling frequency. Incrementnsec
by the method mentioned above.- If the boot process timing is completely irrelevant, I think even just update
nsec
by an really small PRNG number like1~3
is okay
- If the boot process timing is completely irrelevant, I think even just update
Maybe we can discuss which method to adopt or any better modifications here. Once decided, I think I can start to submit a PR.
@chiangkd and @RinHizakura, please comment the above. |
I tend to prefer using "scaled frequency", based on your analysis.
In my opinion, using real-world time offers valuable benefits for developers. It aids in analyzing and identifying potential improvements to accelerate the booting process (e.g., multi-threaded simulations). |
If we just consider the description, scaling frequency seems to be better for me due to the timing accuracy. However, I am still hesitating after thinking further about it. For example, considering debug mode(yes we didn't implement this for semu, but the possibility still exists) the situation can be more complicated: The guest can be stopped, but the host timer will keep going. The clock_gettime() method may lead to a more complex in this case. On the other, most of the time we don't evaluate the software(Linux) performance on the emulator, because there are too many biases on the emulation even if we can solve this timer issue. In summary, do we have any real experience in evaluating the boot time accuracy on the emulator(e.g. qemu)? If we never did this(in my experience this is true), the time accuracy does not matter. I'll support just using the increment counter-based method. |
If I understand correctly, QEMU by default provides a rough(coarse) mapping to the host real-time. This is sufficient for the Guest OS to perceive "time passing," but it does not guarantee accurate or stable synchronization. At the same time, QEMU offers another clock type( For semu, since we aim to allow user programs to leverage benchmarking tools like Dhrystone or CoreMark for performance testing, it is essential to read accurate time. However, reading Thus, as mentioned earlier, I think time during the boot process is relatively less critical. During this period, simply ensuring that the
At this stage, I believe we should focus solely on addressing the RCU CPU stall warning issue. Our current implementation is relatively simplified. Though more complex debugging scenarios could be resolved (e.g., using rebase to adjust time offsets), doing so would require enumerating and handling different cases within our simplified implementation, leading to code clutter. In my opinion, if we want to account for more complex factors, a more comprehensive timer framework would be needed. For example, adopting a cycle-accurate simulation approach with a dedicated thread to update timer values and signaling this thread to obtain monotonic time would be a viable solution. However, I don't think this is very necessary currently. |
Agree. If there is no case where the accuracy of boot time is important, it would be more attractive if the emulator could boot quickly. For example, when we use it as a CI/CD test platform. So I still prefer the counter-based method. My experience may be too one-sided, so feel free to make any exceptions.
Right, I may overthink. We can ignore the extended features as of now. |
Sorry I'd like to revise my previous opinion. I initially preferred using a "scaled frequency" approach based on your analysis. I agree with @RinHizakura and your proposed Method 2—simply incrementing the counter during mode switching.
I agree that precise timekeeping during the boot process is relatively less critical, and frequent I’ve learned a lot from this discussion—thank you! |
TL;DRAfter implementing the method of incrementing Detailed ExplanationInitial Attempt: Incrementing
|
SMP | scale freq by 10 |
increment nsec |
increment ticks |
---|---|---|---|
1 | 223,992,364 | 239,936,190 | 239,937,385 |
2 | 382,486,686 | 410,350,413 | 410,377,969 |
3 | 577,491,593 | 597,403,029 | 599,377,578 |
4 | 774,125,110 | 800,859,292 | 825,078,230 |
5 | 973,274,729 | 1,006,497,972 | 1,005,718,212 |
6 | 1,174,038,398 | 1,211,971,435 | 1,213,769,036 |
7 | 1,377,244,622 | 1,417,808,155 | 1,422,112,653 |
8 | 1,605,001,986 | 1,622,327,309 | 1,627,353,803 |
9 | 1,793,136,295 | 1,832,897,356 | 1,841,933,374 |
10 | 2,005,988,752 | 2,053,838,405 | 2,056,838,991 |
11 | 2,220,126,569 | 2,258,511,835 | 2,266,359,688 |
12 | 2,440,897,255 | 2,545,025,764 | 2,483,505,990 |
13 | 2,651,860,790 | 2,732,577,418 | 2,703,809,852 |
14 | 2,882,701,067 | 2,898,655,823 | 2,917,853,568 |
15 | 3,103,978,838 | 3,122,944,650 | 3,141,405,055 |
16 | 3,343,030,072 | 3,355,869,282 | 3,450,041,050 |
17 | 3,566,365,881 | 3,534,461,625 | 3,587,717,644 |
18 | 3,800,214,669 | 3,770,840,706 | 3,819,287,621 |
19 | 4,031,961,176 | 3,997,057,349 | 4,047,732,234 |
20 | 4,280,331,336 | 4,240,098,183 | 4,277,291,251 |
21 | 4,516,731,902 | 4,470,467,091 | 4,506,903,787 |
22 | 4,883,959,327 | 4,822,792,362 | 4,731,616,330 |
23 | 5,143,022,258 | 4,954,544,816 | 4,975,643,903 |
24 | 5,260,058,753 | 5,198,275,343 | 5,208,915,523 |
25 | 5,526,277,854 | 5,592,128,390 | 5,601,777,728 |
26 | 5,790,681,086 | 5,566,017,777 | 5,696,829,697 |
27 | 6,044,658,240 | 5,813,936,910 | 5,932,303,468 |
28 | 6,328,119,424 | 6,059,320,278 | 6,189,134,088 |
29 | 6,598,156,499 | 6,304,558,565 | 6,425,892,515 |
30 | 6,868,480,625 | 6,553,948,238 | 6,682,215,118 |
31 | 7,129,979,196 | 6,808,544,491 | 6,935,364,394 |
32 | 7,410,129,712 | 7,067,744,976 | 7,180,650,711 |
Boot Process Time (seconds):
SMP | scale freq by 10 |
increment nsec |
increment ticks |
---|---|---|---|
1 | 3.40001 | 3.43001 | 3.37001 |
2 | 8.01002 | 7.82002 | 7.60002 |
3 | 13.44003 | 12.97003 | 13.00003 |
4 | 17.85185 | 17.59004 | 16.74004 |
5 | 22.94007 | 21.54004 | 21.20004 |
6 | 27.11009 | 26.15005 | 25.74005 |
7 | 31.80010 | 30.50006 | 30.52006 |
8 | 37.52011 | 34.66007 | 34.33017 |
9 | 41.41014 | 39.05009 | 38.61007 |
10 | 45.53015 | 42.95009 | 42.74009 |
11 | 51.66018 | 47.63010 | 46.40010 |
12 | 56.13018 | 50.37010 | 51.27010 |
13 | 60.71019 | 53.60011 | 55.89012 |
14 | 65.92020 | 60.54012 | 60.16012 |
15 | 70.30022 | 65.15013 | 65.81013 |
16 | 76.31025 | 70.09014 | 67.95014 |
17 | 80.24026 | 73.90014 | 73.29030 |
18 | 86.59028 | 78.92016 | 78.91032 |
19 | 92.00030 | 82.91016 | 84.04035 |
20 | 94.47030 | 88.05019 | 88.06036 |
21 | 101.68033 | 92.53019 | 92.11038 |
22 | 104.95035 | 94.37019 | 96.46040 |
23 | 110.69036 | 103.57021 | 101.64042 |
24 | 118.59098 | 107.63022 | 106.39043 |
25 | 125.30041 | 109.88022 | 108.33045 |
26 | 132.98045 | 114.10023 | 116.97026 |
27 | 140.04046 | 120.04024 | 121.35049 |
28 | 146.18047 | 124.85025 | 128.63053 |
29 | 154.15050 | 129.38026 | 131.42054 |
30 | 159.83052 | 135.39027 | 136.87057 |
31 | 163.82054 | 140.07028 | 142.77042 |
32 | 170.80054 | 146.00028 | 147.76042 |
Emulator Time:
SMP | emulator time (nsec ) |
emulator time (ticks) |
---|---|---|
1 | 11.086913 | 11.087134 |
2 | 9.344152 | 9.334738 |
3 | 8.656941 | 9.047707 |
4 | 8.694160 | 9.359388 |
5 | 9.095570 | 9.093272 |
6 | 8.750582 | 9.135809 |
7 | 8.944581 | 9.148663 |
8 | 8.759224 | 9.156270 |
9 | 8.237356 | 9.208044 |
10 | 9.205089 | 9.225684 |
11 | 8.083883 | 9.235738 |
12 | 9.128268 | 9.241278 |
13 | 7.331282 | 9.288196 |
14 | 7.742981 | 9.305096 |
15 | 8.340038 | 9.334116 |
16 | 8.955129 | 9.622751 |
17 | 6.269885 | 9.388393 |
18 | 6.666563 | 9.411124 |
19 | 7.066534 | 9.423658 |
20 | 7.491480 | 9.434454 |
21 | 7.888990 | 9.449546 |
22 | 8.529552 | 9.467739 |
23 | 8.726807 | 9.551230 |
24 | 9.146896 | 9.573380 |
25 | 9.866130 | 9.875079 |
26 | 4.871893 | 9.620185 |
27 | 5.083991 | 9.613555 |
28 | 5.293480 | 9.695915 |
29 | 5.501661 | 9.714826 |
30 | 5.713393 | 9.741353 |
31 | 5.929234 | 9.735223 |
32 | 6.150587 | 9.786437 |
Due to the slower environment of my workstation, I tested only SMP=4
and SMP=32
on my workstation:
SMP | Scale Freq by 10 | Increment Ticks |
---|---|---|
4 | 70.87429 | 56.76912 |
32 | 634.91923 | 578.99135 |
It can show that this approach saved nearly one minute on the slower environment.
Remarks
Maintaining ticks resolved the inconsistency issues encountered with nsec
increments. When using the nsec
increment method, each boot process execution yield slight differences in the number of semu_timer_total_ticks
calls and the emulator's internal time. These variations result in inconsistent internal emulator time across runs.
However, the ticks increment method consistently produces identical results for every run. I think this is because the value added to the ticks during each call to semu_timer_total_ticks
is fixed, resulting in the emulator time being the same during each boot process.
But there is an fluctuation when SMP is set to 4, 16 and 26. I am uncertain whether this is entirely expected. Aside from this, there is no operational problems been observed in the emulator's functionality so far.
Currently, the number of SMPs is hardcoded in the code as a macro. Therefore, when testing, adjustments must be made according to the SMP count specified at compile time.
Comparatively, the scaling method appears more stable (as it only scales time). Also, there are still unexplained results when using the ticks increment method. Therefore, I am unsure which method is better. Based on these results, I’d like to ask for your opinions. Currently, the advantages and disadvantages of the scaling method are as follows:
Advantages of the Ticks Increment Method:
Disadvantages:
As everything is confirmed, I think I can start to submit a PR. BTW, In the sample code above, I’ve added a debug message showing the boot process duration at startup for easier debugging and performance testing. Should we include this message in the final PR submission? |
Don't do this. |
I noticed that the results (the times call
I am unsure why the results remain consistent over short intervals but fluctuate when retested after a day. Additionally, I am not certain whether these results are fully analyzable. I speculate that this behavior is closely related to the state of the Host OS. At the very least, the consistency of the short-term results is good. The table below shows the values I just measured. Some values are identical to those from yesterday, while most are different. However, they still exhibit a similar distribution.
This table helps explain why there is oscillation in the results. We assume that for every increment of 1 in SMP, the number of times Additionally, it is noteworthy that the total ticks are always greater than the ticks corresponding to 10 seconds (650,000,000). However, the time displayed during the boot process is almost always less than 10 seconds. Dividing the total ticks by 650,000,000 yields the target boot time, and subtracting this from the displayed time results in a value that generally falls between 0.9 and 1.5 seconds. In summary, there are still several unexplained aspects of this method's results:
Nonetheless, this method reduces the execution time of the boot process. Although the resulting values are less interpretable, the deviations remain within an acceptable range. Therefore, I think the second method is still preferable. |
After several days of analysis and discussion, I think the acceleration effect of the Therefore, I revisited the first Below is an explanation of the sample code: Using
The logic inside seconds. And for example, if this calculation results in 200 seconds, we can use the target time to set the scale factor. Assuming the target time is 10 seconds, the scaling factor would be Below is the sample code:
Currently, only the |
Execute semu with multi-core system simulation
The RCU CPU stall warning, as discussed in #49 , is accompanied by an increase in timer interrupts and
clock_gettime
system calls to produce a real-time timer. This causes the CPU to wait longer than a typical grace period, which is usually 21 seconds.Implementing multi-thread support for semu, as discussed in PR #49, might improve the performance of the booting process.
The text was updated successfully, but these errors were encountered: