primecount-7.0
In primecout-6.x the AC algorithm (which is part of Gourdon's algorithm) scales very badly on PCs & servers with a large number of CPU cores. The two root causes of this scaling issue are cache misses and thread synchronization overhead. In primecount-7.0 the AC algorithm has been completely rewritten, now all threads are independent from each other and require only minimal synchronization. Furthermore each thread operates on its own tiny chunk of memory that conveniently fits into the CPU's cache. On my dual socket AMD EPYC server this improves performance by more than 2x for large AC computations with x ≥ 1e23. The P2 & B algorithms have also been rewritten so that the threads are independent from each other.
The Easy-Special-Leaves.md document contains an in-depth description of the new AC algorithm.
AC.cpp
: New algorithm with improved scaling.AC_libdivide.cpp
: New algorithm with improved scaling.B.cpp
: Improved scaling due to independent threads.P2.cpp
: Improved scaling due to independent threads.LoadBalancerAC.cpp
: New thread scheduler for AC algorithm.LoadBalancerS2.cpp
: Improve load balancing of S2_hard & D.LoadBalancerP2.cpp
: Rewritten, now similar to other load balancers.SegmentedPiTable.cpp
: Decrease size to x^(1/4).util.cpp
: Improve scaling using larger default alpha_z = 2.imath.hpp
: Optimizeilog2()
&next_power_of_2()
using__builtin_clzll()
.- Remove MPI support: No users, too much work to maintain.