Join PrimeGrid
Returning Participants
Community
Leader Boards
Results
Other
drummers-lowrise
|
Message boards :
Seventeen or Bust :
AMD 1950X - SOB WU's
Author |
Message |
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Before I go off and do a whole bunch of testing which could be for nothing, does anyone run a 16 core processor and have completed Michael's testing to get the best thread/task result?
Someone may have done all the hard work and can say that 4*4 works best.
I'm not up for reinventing the wheel :D
The current TDP challenge reignites my liking of PG :D Especially now that I have the settings worked out (Sun, Mercury etc)
____________
Слава Україні! | |
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 548 ID: 284516 Credit: 1,722,200,477 RAC: 3,336,716
                            
|
Someone with a 1950X specifically would have to chime in here.
No two 16 core CPU's are alike. Different cache sizes, different architectures, etc will all impact where the "optimal" settings for performance are.
My advice is to test and see where the throughput starts to drop off.
1x16 threads
2x8 threads
etc
Once you hit a point of lower throughput you can stop testing.
____________
My Primes
Badge Score: 4*2 + 6*2 + 7*1 + 8*11 + 9*1 + 11*3 + 12*1 = 169
| |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Thanks Keith.
Lets see if someone comes along. SoB is a pretty specific Thread.
____________
Слава Україні! | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
One possibly easier method to test it, other than to manually run each configuration, is to use the Prime95 built in benchmark.
Download the latest version of Prime95 and go to the benchmark tab. Enter into the min and max FFT size the size of a SoB unit. You'll have to find that from elsewhere, although in a quick poke 3072k is one that was used recently. Hopefully it'll detect 16 cores, you can uncheck "use hyperthreading", I think it is more correct in this case to check "use complex FFTs" but it doesn't make much difference. Under number of workers to benchmark, this is how many simultaneous tasks, and it'll divide the cores between them. I think 1, 2, 4, 8, 16 are meaningful options here. So that would run 1 task with 16 cores, 2 tasks with 8 cores each, and so on. You want to see the highest throughput reported.
Example output for a quad core running 1, 2, 4 workers:
Timings for 3072K all-complex FFT length (4 cores, 1 worker): 3.13 ms. Throughput: 318.98 iter/sec.
Timings for 3072K all-complex FFT length (4 cores, 2 workers): 6.27, 6.22 ms. Throughput: 320.33 iter/sec.
Timings for 3072K all-complex FFT length (4 cores, 4 workers): 12.52, 12.39, 12.38, 12.41 ms. Throughput: 321.89 iter/sec.
It seems 4 workers (4 tasks using 1 core each) is fastest, but only about 1% faster than 1 task using 4 cores. So in this case I'd use the latter as it isn't a significant difference, but returning units faster is more helpful than 1% extra possible throughput. | |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Hi Mackerel,
I checked and saw a max FFT of 3200K.
Do I have the "Run FFT's in place" checkbox ticked?
Also, how long would I run the test for?
"Time to run each FFT size"
Thank you
____________
Слава Україні! | |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,203,435,344 RAC: 1,751
                        
|
http://www.primegrid.com/forum_thread.php?id=8266
EDIT: It may be noteworthy that back then I did not set affinities to physical cores only. May be worth another try.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
That sounds like the stress test window, cancel that. The benchmark option is in the menu somewhere. | |
|
|
It's under Options, right under torture test
____________
My lucky number is 6219*2^3374198+1
| |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Oops. Thank you. And thank you Azmodes.
Can you please explain these results?
http://www.primegrid.com/forum_thread.php?id=8266&nowrap=true#123869
I understand the 1 w/u across 16 cores etc right down to 16 wu's with 1 core each but which is the most productive?
Somewhere in the middle I guess.
____________
Слава Україні! | |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
[Feb 7 23:20] Benchmarking multiple workers to measure the impact of memory bandwidth
[Feb 7 23:20] Timing 16K all-complex FFT, 16 cores, 1 worker. Average times: 0.14 ms. Total throughput: 7306.25 iter/sec.
[Feb 7 23:21] Timing 16K all-complex FFT, 16 cores, 4 workers. Average times: 0.05, 0.06, 0.05, 0.06 ms. Total throughput: 75110.17 iter/sec.
[Feb 7 23:22] Timing 16K all-complex FFT, 16 cores, 8 workers. Average times: 0.06, 0.06, 0.08, 0.07, 0.06, 0.06, 0.06, 0.06 ms. Total throughput: 122487.72 iter/sec.
[Feb 7 23:23] Timing 16K all-complex FFT, 16 cores, 16 workers. Average times: 0.09, 0.09, 0.09, 0.09, 0.09, 0.10, 0.09, 0.10, 0.10, 0.10, 0.09, 0.09, 0.10, 0.10, 0.09, 0.10 ms. Tot
[Feb 7 23:24] Timing 18K all-complex FFT, 16 cores, 1 worker. Average times: 0.15 ms. Total throughput: 6796.79 iter/sec.
[Feb 7 23:25] Timing 18K all-complex FFT, 16 cores, 4 workers. Average times: 0.06, 0.07, 0.07, 0.06 ms. Total throughput: 59685.85 iter/sec.
[Feb 7 23:26] Timing 18K all-complex FFT, 16 cores, 8 workers. Average times: 0.07, 0.09, 0.08, 0.08, 0.08, 0.10, 0.08, 0.08 ms. Total throughput: 97323.04 iter/sec.
[Feb 7 23:27] Timing 18K all-complex FFT, 16 cores, 16 workers. Average times: 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11 ms. Tot
stopped.
So, the answer is here but I don't know what I'm looking at lol.
Which one is best?
Thanks guys
____________
Слава Україні!
| |
|
|
[Feb 7 23:20] Benchmarking multiple workers to measure the impact of memory bandwidth
[Feb 7 23:20] Timing 16K all-complex FFT, 16 cores, 1 worker. Average times: 0.14 ms. Total throughput: 7306.25 iter/sec.
[Feb 7 23:21] Timing 16K all-complex FFT, 16 cores, 4 workers. Average times: 0.05, 0.06, 0.05, 0.06 ms. Total throughput: 75110.17 iter/sec.
[Feb 7 23:22] Timing 16K all-complex FFT, 16 cores, 8 workers. Average times: 0.06, 0.06, 0.08, 0.07, 0.06, 0.06, 0.06, 0.06 ms. Total throughput: 122487.72 iter/sec.
[Feb 7 23:23] Timing 16K all-complex FFT, 16 cores, 16 workers. Average times: 0.09, 0.09, 0.09, 0.09, 0.09, 0.10, 0.09, 0.10, 0.10, 0.10, 0.09, 0.09, 0.10, 0.10, 0.09, 0.10 ms. Tot
[Feb 7 23:24] Timing 18K all-complex FFT, 16 cores, 1 worker. Average times: 0.15 ms. Total throughput: 6796.79 iter/sec.
[Feb 7 23:25] Timing 18K all-complex FFT, 16 cores, 4 workers. Average times: 0.06, 0.07, 0.07, 0.06 ms. Total throughput: 59685.85 iter/sec.
[Feb 7 23:26] Timing 18K all-complex FFT, 16 cores, 8 workers. Average times: 0.07, 0.09, 0.08, 0.08, 0.08, 0.10, 0.08, 0.08 ms. Total throughput: 97323.04 iter/sec
[Feb 7 23:27] Timing 18K all-complex FFT, 16 cores, 16 workers. Average times: 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11 ms. Tot
stopped.
It seems you didn't paste your results correctly with the bolded parts. From your current results, the lines underlined may be the best throughput option.[/u]
____________
My lucky number is 6219*2^3374198+1
| |
|
Azmodes Volunteer tester
 Send message
Joined: 30 Dec 16 Posts: 184 ID: 479275 Credit: 2,203,435,344 RAC: 1,751
                        
|
Oops. Thank you. And thank you Azmodes.
Can you please explain these results?
http://www.primegrid.com/forum_thread.php?id=8266&nowrap=true#123869
I understand the 1 w/u across 16 cores etc right down to 16 wu's with 1 core each but which is the most productive?
Somewhere in the middle I guess.
Well, no. Check the tasks per day column. The issue with my findings was that multithreading made things faster, but always hurt throughput, i.e. 16 single-core tasks were always the most productive.
Moi wrote: EDIT: It may be noteworthy that back then I did not set affinities to physical cores only. May be worth another try.
Although I just noticed that I tried it with SMT turned off too and got the same results, so I doubt core affinites with it on and 50% usage would make any difference.
____________
Long live the sievers.
+ Encyclopaedia Metallum: The Metal Archives + | |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Whoops.
Timings for 16K all-complex FFT length (16 cores, 1 worker): 0.14 ms. Throughput: 7306.25 iter/sec.
Timings for 16K all-complex FFT length (16 cores, 4 workers): 0.05, 0.06, 0.05, 0.06 ms. Throughput: 75110.17 iter/sec.
Timings for 16K all-complex FFT length (16 cores, 8 workers): 0.06, 0.06, 0.08, 0.07, 0.06, 0.06, 0.06, 0.06 ms. Throughput: 122487.72 iter/sec.
Timings for 16K all-complex FFT length (16 cores, 16 workers): 0.09, 0.09, 0.09, 0.09, 0.09, 0.10, 0.09, 0.10, 0.10, 0.10, 0.09, 0.09, 0.10, 0.10, 0.09, 0.10 ms. Throughput: 170700.31 iter/sec.
[Fri Feb 07 23:25:19 2020]
Timings for 18K all-complex FFT length (16 cores, 1 worker): 0.15 ms. Throughput: 6796.79 iter/sec.
Timings for 18K all-complex FFT length (16 cores, 4 workers): 0.06, 0.07, 0.07, 0.06 ms. Throughput: 59685.85 iter/sec.
Timings for 18K all-complex FFT length (16 cores, 8 workers): 0.07, 0.09, 0.08, 0.08, 0.08, 0.10, 0.08, 0.08 ms. Throughput: 97323.04 iter/sec.
Timings for 18K all-complex FFT length (16 cores, 16 workers): 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11 ms. Throughput: 142896.52 iter/sec.
This is always just a trade off isn't it? 1 core/wu might have a larger throughput number but the result will be slower to complete and therefore less 1st's?
____________
Слава Україні! | |
|
mackerel Volunteer tester
 Send message
Joined: 2 Oct 08 Posts: 2652 ID: 29980 Credit: 570,442,335 RAC: 5,621
                              
|
Whoops.
Those results aren't helpful for SoB, you forgot to change the FFT size it was testing. 16k and 18k are really tiny. Set it to 3200k for min and max, and see what results that give. You might also add 2 workers into the list of workers it tests, so it would be 1, 2, 4, 8, 16.
What you are looking for from a productivity perspective is higher iter/sec. However, if you want to also be "first" you might sacrifice some throughput for faster units by using a higher number of cores per worker/task. | |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
Whoops.
Those results aren't helpful for SoB, you forgot to change the FFT size it was testing. 16k and 18k are really tiny. Set it to 3200k for min and max, and see what results that give. You might also add 2 workers into the list of workers it tests, so it would be 1, 2, 4, 8, 16.
What you are looking for from a productivity perspective is higher iter/sec. However, if you want to also be "first" you might sacrifice some throughput for faster units by using a higher number of cores per worker/task.
Like this?
Prime95 64-bit version 29.8, RdtscTiming=1
Timings for 3200K all-complex FFT length (16 cores, 1 worker): 2.19 ms. Throughput: 457.20 iter/sec.
Timings for 3200K all-complex FFT length (16 cores, 2 workers): 4.15, 4.38 ms. Throughput: 469.28 iter/sec.
Timings for 3200K all-complex FFT length (16 cores, 4 workers): 8.90, 8.81, 9.51, 9.50 ms. Throughput: 436.33 iter/sec.
Timings for 3200K all-complex FFT length (16 cores, 8 workers): 18.14, 18.51, 18.54, 18.02, 19.92, 19.99, 19.61, 19.88 ms. Throughput: 420.13 iter/sec.
Timings for 3200K all-complex FFT length (16 cores, 16 workers): 36.61, 39.55, 36.76, 37.01, 37.72, 37.72, 36.99, 37.00, 40.83, 40.14, 41.33, 39.23, 39.75, 39.64, 39.85, 39.68 ms. Throughput: 413.68 iter/sec.
So it looks to me like 2 wu's split across 16 cores is optimal?
Possible tweaking required if I want more 1st's (as I've done with my 6 core CPU's for the TDP challange. 3*2 didn't give me enough 1st's for my liking so I changed to 2*3.)
____________
Слава Україні! | |
|
|
I think so, try it with actual SoB WU's after TdP.
____________
My lucky number is 6219*2^3374198+1
| |
|
|
I've just got the 16 core (32 thread) AMD 1950X and am wondering the same. I've not really done any testing other than I briefly ran 1 x 32 threads, 2 x 16 threads and 2 x 8 threads to see how my temps looked, and 2 x 8 threads ran the most hot (~68C) so I'm taking that configuration to be doing the most computation and am thus using that for now. I've only just built my PC and yet to install decent RAM and do the XMP thing. I suppose that because the CPU has two chiplets I expect either 4 x 4 threads or 2 x 8 threads to be best bet? | |
|
Chooka  Send message
Joined: 15 May 18 Posts: 335 ID: 1014486 Credit: 1,312,804,386 RAC: 3,638,274
                         
|
I’m thinking the same. I’ll need to do some testing but I think 2 * 8 will be the go. 4*4 will work but I’m still looking for a few 1st and I don’t think I’ll get many with 4*4.
You suffer the same fate as me... thermal throttling at 68 degrees. My system never stays below 67.
____________
Слава Україні! | |
|
|
As I don't have my machine on all the time I'm going to stick with 2x8 so they finish as quick as reasonably possible yet still running hopefully efficiently, and should get a few firsts too :)
I've not overclocked anything, while SOB is running the CPU runs at 4000Mhz (40x100MHz), my board is an X570 Aorus Master. | |
|
Monkeydee Volunteer tester
 Send message
Joined: 8 Dec 13 Posts: 548 ID: 284516 Credit: 1,722,200,477 RAC: 3,336,716
                            
|
gratrix, you have a 3950X as opposed to Chooka's 1950X. Both are 16core/32 thread CPU's, but due to architectural differences they will have vastly different results.
For SoB on the 3950X I would recommend trying 1 16 threaded task.
gratrix, you mentioned you did 1x32 and 2x16. That's using all the threads which may or may not have a performance impact on these AMD chips. So give 1x16 a try and see what happens.
I have a 3900X (12 core / 24 thread) and found that the best throughput was 1 12 threaded task instead of 2 6 threaded tasks.
Otherwise it would be best to see the other posts earlier in the thread for how you can test your 3950X for best throughput or best speed per task.
____________
My Primes
Badge Score: 4*2 + 6*2 + 7*1 + 8*11 + 9*1 + 11*3 + 12*1 = 169
| |
|
|
Indeed I do have a 3950X :) sorry if I caused any confusion. What the best configuration for me was on my mind when I saw the thread pop up and failed to read the title properly.
Thanks for the advice, I will look into it.
I'll be quiet now :) | |
|
Message boards :
Seventeen or Bust :
AMD 1950X - SOB WU's |