Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Sierpinski/Riesel Base 5 Problem :
Optimal CPU count?
Author |
Message |
|
Has anyone done experimenting with 4+ cores assigned to a single WU? Does SR5 have a thread limit that won't help crunching? Trying to crank these out as fast as possible. | |
|
RafaelVolunteer tester
 Send message
Joined: 22 Oct 14 Posts: 888 ID: 370496 Credit: 346,354,112 RAC: 546,356
                   
|
Has anyone done experimenting with 4+ cores assigned to a single WU? Does SR5 have a thread limit that won't help crunching? Trying to crank these out as fast as possible.
As with most things in life... it depends, mainly on your processador architecture, core count, RAM and what you do with your PC.
For instance, take my 4c Haswell. It has relatively low clocks (3.5ghz) and Dual channel, Dual rank 2133mhz RAM, but I found running 4C would normally be the best... but because it is also my daily driver, regular usage steals cycles and it ends up being SLOWER than running with 3C only, even if it's faster to run with 4c on a vaccum.
So beest advice is to test for yourself. Turn of Hyperthreading (if you can) and run a couple benchmarks, Prime95 has an intuitive tool for helping you easily figure out performance numbers. | |
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 440 ID: 284516 Credit: 429,422,985 RAC: 667,707
                       
|
As Rafael says, test with various core counts and work unit counts.
And "fast" here can be described in two ways. One is throughput, or how many units you can do in a set amount of time. Two is fastest unit. They are not always the same. So it is best to test to strike the balance however you want it.
Also, there is no limit of how many cores you can throw at a single task. The more cores the faster the task, but you might get more tasks done by running more than one task at a time within the same time span.
____________
My Primes
Badge Score: 2*1 + 4*2 + 6*4 + 7*10 + 9*1 + 10*2 = 133
| |
|
mikey Send message
Joined: 17 Mar 09 Posts: 1243 ID: 37043 Credit: 519,835,681 RAC: 129,910
                    
|
My AMD 1920X was doing them using 4 cores per wu, 5 wu at a time in about 14 hours for each wu. I switched to 5 cores per wu, 4 wu at a time and the time is closer to 4 hours for each wu. I have 32gb of ddr4 quad channel ram in the pc. | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2468 ID: 29980 Credit: 449,457,152 RAC: 300,933
                           
|
So beest advice is to test for yourself. Turn of Hyperthreading (if you can) and run a couple benchmarks, Prime95 has an intuitive tool for helping you easily figure out performance numbers.
A rough guide to use Prime95 (assuming Windows version below):
1, download it, extract it, run it
2, click "just stress testing", then cancel the next window that appears since we're not actually stress testing
3, go to Options > Benchmark window
4, It should be on Throughput benchmark. For SR5, enter 768 under both minimum and maximum FFT size, as that is the current maximum size I've seen in use. For other projects, you'd need to find out one way other other what size they are instead.
5, check "Benchmark all-complex FFTs" - apparently this makes it work in a way more representative of LLR (as commonly used here) although it doesn't seem to make that much difference overall
6, Set "Number of CPU cores to benchmark" to the number of real cores you have. It should detect this by itself.
7, Uncheck "Benchmark hyperthreading" - unless you really want to try it, it doesn't usually help.
8, You may want to edit "Number of workers to benchmark". A worker is what we would call a task, so this is how many tasks to run at once as a comma separated list. It tries to pick some sensible combinations, but you might want to add more. I'd run all factors of the real core count, including non-prime ones, including 1. For example, if you have a 12 core CPU, I'd run "1, 2, 3, 4, 6". If you have a 8 core CPU, try "1, 2, 4, 8". For a 6 core CPU, try "1, 2, 3, 6"...
9, personally I set "Time to run each benchmark" to 5 seconds which is the minimum it allows. I'd do a couple of repeats in case something else was happening to change the results e.g. background tasks.
Look at the results for the highest throughput value. This is the combination that will get you the most overall throughput - tasks completed in a given time.
Example results for a i9-7920X with turbo disabled:
Timings for 768K all-complex FFT length (12 cores, 1 worker): 0.54 ms. Throughput: 1835.54 iter/sec.
Timings for 768K all-complex FFT length (12 cores, 2 workers): 0.55, 0.55 ms. Throughput: 3659.43 iter/sec.
Timings for 768K all-complex FFT length (12 cores, 3 workers): 0.73, 0.72, 0.71 ms. Throughput: 4158.77 iter/sec.
Timings for 768K all-complex FFT length (12 cores, 4 workers): 0.93, 0.95, 0.95, 0.91 ms. Throughput: 4280.10 iter/sec.
Timings for 768K all-complex FFT length (12 cores, 6 workers): 1.57, 1.58, 1.58, 1.51, 1.59, 1.58 ms. Throughput: 3825.98 iter/sec.
Timings for 768K all-complex FFT length (12 cores, 12 workers): 3.99, 4.01, 4.00, 4.08, 4.05, 4.09, 4.04, 4.02, 4.14, 3.99, 4.03, 4.06 ms. Throughput: 2968.99 iter/sec.
We see the highest throughput is obtained for running 4 workers (tasks) at once, implicitly each with 3 cores. But what about the relative speed of each task? This may be interesting for those aiming to be "1st" more often and trade off some throughput for the shorter time. One way to do it is to look at the timings shown before the throughput. Because this is the time taken to do each step of the calculation, lower is better. It is clear as we assign more cores to fewer simultaneous tasks, it gets faster, to a point. 1 task with 12 cores is barely faster than running 1 task with 6 cores, which you can do two at the same time. So that makes no sense to use. Of interest is 3 workers (of 4 cores), as it is only slightly slower overall throughput than 4 workers (of 3 cores), but it is somewhere over 20% faster per unit. And that is the configuration I'm running it in. While 2 workers (of 6 cores) is faster per unit again, there is more hit to the overall throughput.
Example results for an i7-8086k with turbo disabled:
Timings for 768K all-complex FFT length (6 cores, 1 worker): 0.49 ms. Throughput: 2033.05 iter/sec.
Timings for 768K all-complex FFT length (6 cores, 2 workers): 0.96, 0.97 ms. Throughput: 2070.05 iter/sec.
Timings for 768K all-complex FFT length (6 cores, 3 workers): 2.22, 2.22, 2.23 ms. Throughput: 1349.15 iter/sec.
Timings for 768K all-complex FFT length (6 cores, 6 workers): 5.49, 5.47, 5.50, 5.49, 5.48, 5.48 ms. Throughput: 1094.29 iter/sec.
Here we see 2 workers has the highest throughput, but 1 worker is only slightly less but will turn around a task in half the time. So that's what I'm using.
Example result for a Ryzen 7 3700X (stock operation):
Timings for 768K all-complex FFT length (8 cores, 1 worker): 0.43 ms. Throughput: 2319.20 iter/sec.
Timings for 768K all-complex FFT length (8 cores, 2 workers): 0.68, 0.68 ms. Throughput: 2940.09 iter/sec.
Timings for 768K all-complex FFT length (8 cores, 4 workers): 1.34, 1.34, 1.34, 1.33 ms. Throughput: 2987.78 iter/sec.
Timings for 768K all-complex FFT length (8 cores, 8 workers): 6.43, 6.42, 6.41, 6.43, 6.34, 6.30, 6.42, 6.49 ms. Throughput: 1249.39 iter/sec.
Here we have 4 workers (of 2 cores) fastest but 2 workers (of 4 cores) only slightly behind. So I might as well run 2 tasks of 4 cores with about half the turn around time.
The throughput numbers can be used as a way to compare the speeds of different systems.
My AMD 1920X was doing them using 4 cores per wu, 5 wu at a time in about 14 hours for each wu. I switched to 5 cores per wu, 4 wu at a time and the time is closer to 4 hours for each wu. I have 32gb of ddr4 quad channel ram in the pc.
This is a 12 core CPU so those combinations of tasks/threads are not what I'd have considered. Intuitively 4 tasks of 3 cores each would be a safe choice, as it keeps the data on each CCX. I hope that when you are running 4 tasks of 5 threads each, your OS is smart enough to keep them on the same CCX. Another interesting combination to try might be 2 tasks of 6 cores each, but as this involves crossing a CCX, the memory bandwidth comes into play. Thankfully at only 6 Zen 1 cores per die, there isn't too much demand of the ram bandwidth. I'm not sure about running 1 task on all 12 cores since this particular CPU has NUMA nodes. | |
|
Post to thread
Message boards :
Sierpinski/Riesel Base 5 Problem :
Optimal CPU count? |