## Other

drummers-lowrise

Message boards : Generalized Cullen/Woodall prime search : How to speed up (some) GCW units by 10%+

 Subscribe SortOldest firstNewest firstHighest rated posts first
Author Message
Message 107873 - Posted: 13 May 2017 | 2:09:26 UTC

Nothing of what I am about to write is new. Or difficult.
And it has already been discussed multiple times.

Observe that some the chosen bases for this particular sub-project happen to be squares.
25 = 5^2, 49 = 7^2, 121 = 11^2. Some people even know that this was a deliberate choice. I've been waiting for a workunit to arrive for one of these bases for a while, and now I have one.

The candidate is 754806*121^754806+1 which can also be easily regrouped as 754806*11^1509612+1.

Let's compare:

/home/serge/NumTheory/GCW> llr -d d.npg
Base prime factor(s) taken : 11
Starting N-1 prime test of 754806*121^754806+1
Using zero-padded AVX FFT length 720K, Pass1=320, Pass2=2304, a = 3
754806*121^754806+1, bit: 70000 / 5222413 [1.34%]. Time per bit: 5.654 ms.

/home/serge/NumTheory/GCW/2> llr -d d2.npg
Base prime factor(s) taken : 11
Starting N-1 prime test of 754806*11^1509612+1
Using zero-padded AVX FFT length 640K, Pass1=640, Pass2=1K, a = 3
754806*11^1509612+1, bit: 40000 / 5222416 [0.76%]. Time per bit: 4.698 ms.

That's 20% faster. (Results are similar for AVX2.)

Why is the server sending this workunit task as 754806*121^754806+1 ?
Isn't it trivial, server-side, to fetch the candidate from the database and if/when b=x^2, send it to the client not as
100000000000000:P:1:b:1 n n

but as
100000000000000:P:1:x:1 n 2n

In this particular case:
100000000000000:P:1:11:1 754806 1509612

Too hard to implement?

Message 107874 - Posted: 13 May 2017 | 2:28:00 UTC

Thanks, Serge.

Do you know why LLR chooses different FFT sizes for the same number? I was not aware it would do that.
____________
My lucky number is 75898524288+1

Message 107876 - Posted: 13 May 2017 | 4:32:55 UTC - in response to Message 107874.

For LLR it is not the number that matters, but the (k,b,n,c) form, and b is taken by it verbatim, as given.

I am not ready to go into a very deep explanation, but I will try to make an approximation to an explanation (not meant to be taken for that this is exact). With b=11, it is possible to form an array of length 640K where each element is a quasi-digit ("limb") in an unusual representation: some digit's weights are perhaps 11^6 and some digit's weights are 11^7, or something like that. (Off the top of my head, what I remember is that each limb on average is limited to keeping ~30 bits of information, or something like that.) Only powers of b can be used as limb weights.

In contrast if the number is entered with b=121, the program can only work with limbs of, say, 121^3 and 121^2. (It has less opportunities to pack, the larger the b.) For that reason it ponders the array of length 640K and thinks, "nah, some elements will be too large; gotta go for next FFT size", and so it does.

Long story short, using the simplest possible (k,b,n,c) (with b as low as possible) will lend more possibilities for more dense FFT arrays. If not only b=121, but also k is divisible by 11, the FFT size may be even smaller if (k,121,n,c) is transformed into (k/11,11,2*n+1,c).

As an aside, yes, it would be nice if LLR did it all itself, but as the timing test (shown earlier) demonstrates - it doesn't. But here's where we can help LLR and do transformation externally, server-side. The transformation logic is quite straightforward.[/i]

Message 107877 - Posted: 13 May 2017 | 4:53:35 UTC - in response to Message 107876.

Got it, thanks for the explanation. That makes sense.
____________
My lucky number is 75898524288+1

Message 107879 - Posted: 13 May 2017 | 7:50:09 UTC

Maybe propose this optimisation as a feature request to go in future LLR if not already done?

Message 107882 - Posted: 13 May 2017 | 9:00:33 UTC - in response to Message 107873.

Observe that some the chosen bases for this particular sub-project happen to be squares.
25 = 5^2, 49 = 7^2, 121 = 11^2. Some people even know that this was a deliberate choice.

25, 49, 121; that is all the prime squares in the range 13 ≤ b ≤ 121. I wonder if there is some easy reason why n*b^n + 1 is more often composite when b is a perfect square. Does sieving remove a larger fraction, so that the expected occurrence of primes is lower for these b values?

"Deliberate choice"? I though these b values were chosen simply because they were the smallest b for which no known n with n>b-2 gives a prime n*b^n + 1?

/JeppeSN

Addition: I checked on Steven Harvey's page on GC, and for all prime square b among 121, 169, 289, 361, 529, 841, 961, 1369, 1681, 1849, 2209, 2809, 3481, 3721, 4489, 5041, 5329, 6241, 6889, 7921, 9409, the only time an n is known that satisfies n>b-2 is for b=5041 where:

8398*5041^8398 + 1 = 8398*71^16796 + 1

is a prime.

Message 107885 - Posted: 13 May 2017 | 12:29:07 UTC

Serge, thanks for the optimization tip. It's appreciated!

By the way, we (and by "we", I mean "Jim") have recomputed many of the FFT sizes for the candidates. You don't always get a reduced FFT size when you use the square root of b, but you do for the vast majority of candidates. For the other bases, dividing k by b one or more times hasn't reduced the FFT size once yet.
____________
My lucky number is 75898524288+1

Message 107890 - Posted: 13 May 2017 | 16:15:55 UTC - in response to Message 107885.

It seems that in your client-server set up the ideal place to put the reformatter would be the primegrid_llr_wrapper. It would keep the initial task parameters, reformat for (c)llr, get the result back from (c)llr, report back to server as initially requested. Then the database, the server and the accounting code would be unchanged.

primegrid_llr_wrapper for now can do only:
▪ square simplification,
▪ k simplification

Later, it can be extended to recognize b being any power. See here -

Curiously, these numbers may be hard to recognize when written in standard form (emphasis mine).

For example, they may be like
18740*3^168662-1
which could be written
168660*3^168660-1.

More difficult to spot are those like the following:

9750*7^29250-1 = 9750*7^(3*9750)-1 = 9750*343^9750-1
8511*2^374486-1 = (8511*2^2)*2^(11*8511)*4-1 = 34044*2048^34044-1.

This is in fact how the GCWs for 25, 49, 121 will end up showing in UTM lists. (And this is how GW for b=4 looks, indeed.)

Message 107891 - Posted: 13 May 2017 | 17:06:52 UTC - in response to Message 107890.

It seems that in your client-server set up the ideal place to put the reformatter would be the primegrid_llr_wrapper.

That's not the prefered place for the change, but we're still evaluating options.
____________
My lucky number is 75898524288+1

Message 107907 - Posted: 14 May 2017 | 16:11:42 UTC

I could've sworn that LLR itself does the normalizing of the bases (perhaps only for power of 2?). This feature needs to be in LLR itself, tbh.

1. Normalize b if it is a power.
2. Normalize k if b divides k.

I would guess that it is a trivial change in LLR (except for printing output -- where it is arguably important to use the unnormalized values).

Message 107909 - Posted: 14 May 2017 | 17:11:18 UTC - in response to Message 107907.

Maybe LLR's philosophy is "the client is always right!"
I.e.: If the input file calls for a test of a specific FFT or a "specific arrangement of bits", then that's what it will run (even if slower, because "this is the test that was ordered").

But it indeed doesn't follow this rule for powers of 2.

-bash-4.2\$ llr -d -q"27*1024^10007+1"
Starting Proth prime test of 27*2^100070+1
Using all-complex FMA3 FFT length 10K, Pass1=128, Pass2=80, a = 11
27*2^100070+1 is not prime. Proth RES64: C14E6737D2E78E5E Time : 5.261 sec.

-bash-4.2\$ llr -d -q"28*729^10007+1"
Base prime factor(s) taken : 3
Starting N-1 prime test of 28*729^10007+1
Using all-complex FMA3 FFT length 10K, Pass1=128, Pass2=80, a = 3
28*729^10007+1 is not prime. RES64: E83080E955E9B281. OLD64: B89182BC01BD1780 Time : 4.888 sec.

-bash-4.2\$ llr -d -q"28*10000^10007+1"
Base factorized as : 2^4*5^4
Base prime factor(s) taken : 5
Starting N-1 prime test of 28*10000^10007+1
Using all-complex FMA3 FFT length 18K, Pass1=384, Pass2=48, a = 3
28*10000^10007+1 is not prime. RES64: 59CCA66A39ED54C4. OLD64: 0D65F33EADC7FE48 Time : 13.645 sec.

(and of course it is fully equipped to normalize the base, as a side effect of factoring the base for the purposes of the N-1 mechanics.)

PFGW does what it is ordered by the input file, too.

Message 107910 - Posted: 14 May 2017 | 17:53:52 UTC - in response to Message 107909.

And GeneFer seems to do different things, not normalizing or de-normalizing:

.\genefer_windows64.exe -q "6^8388608+1"
.\genefer_windows64.exe -q "36^4194304+1"
.\genefer_windows64.exe -q "1296^2097152+1"
.\genefer_windows64.exe -q "1679616^1048576+1"

Even though the first form (where b=6 is not a square) is "canonical" and the one you would expect to see on Top 5000, it is not clear which form would actually be fastest.

Testing 6^8388608+1... 21684224 steps to go (1849:28:44 remaining)

Testing 36^4194304+1... 21684224 steps to go (747:08:06 remaining)

Testing 1296^2097152+1... 21684224 steps to go (367:41:08 remaining)

Testing 1679616^1048576+1... 21684224 steps to go (160:41:45 remaining)
Estimated time remaining for 1679616^1048576+1 is 1716:50:53

(the last one is switches to x87 (80-bit) transform).

/JeppeSN

Message 107912 - Posted: 14 May 2017 | 19:20:37 UTC

Genefer and LLR/PFGW are completely different. Genefer doesn't normalize anything (although I suppose it could.)

With regards to LLR doing some normalizations but not others, does anyone know if that's LLR's code or gwnum's code?
____________
My lucky number is 75898524288+1

Message 107936 - Posted: 16 May 2017 | 5:09:50 UTC

If you let LLR do the normalization, it will be impossible to rerun serge's benchmark comparison. But once is enough to prove a point.

Message 107998 - Posted: 18 May 2017 | 15:39:39 UTC - in response to Message 107907.

I could've sworn that LLR itself does the normalizing of the bases (perhaps only for power of 2?). This feature needs to be in LLR itself, tbh.

This appears also not to be the case. Currently, I've seen reduction in overall testing times, ranging from 9% to 33%, dependant on FFT length and weather I'm on my Sandy Bridge or Haswell. So it appears, that LLR is also not doing a normalizing for bases that are powers of 2, but in fact still tests k*16^n+/-1 as base 16 number and not k*2^(n*4)+/-1 - even though the screen shows that k*2^(n*4)+/-1 is being tested.

To sum up, at least on my system, there can be up to 33% reduction of testing time per k*b^n+/-1 test, by normalizing the test, if it is a power of a base, to smallest possible base.

Just my 2 cents, take care :)

Regards

KEP

Message 109348 - Posted: 14 Aug 2017 | 13:29:21 UTC - in response to Message 107873.

The candidate is 754806*121^754806+1 which can also be easily regrouped as 754806*11^1509612+1.

Let's compare:
/home/serge/NumTheory/GCW> llr -d d.npg
Base prime factor(s) taken : 11
Starting N-1 prime test of 754806*121^754806+1
Using zero-padded AVX FFT length 720K, Pass1=320, Pass2=2304, a = 3
754806*121^754806+1, bit: 70000 / 5222413 [1.34%]. Time per bit: 5.654 ms.

/home/serge/NumTheory/GCW/2> llr -d d2.npg
Base prime factor(s) taken : 11
Starting N-1 prime test of 754806*11^1509612+1
Using zero-padded AVX FFT length 640K, Pass1=640, Pass2=1K, a = 3
754806*11^1509612+1, bit: 40000 / 5222416 [0.76%]. Time per bit: 4.698 ms.

That's 20% faster. (Results are similar for AVX2.)

Why is the server sending this workunit task as 754806*121^754806+1 ?
Isn't it trivial[?]

You would be surprised at how utterly non-trivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.
____________
My lucky number is 75898524288+1

Message 109353 - Posted: 14 Aug 2017 | 20:08:30 UTC

So, 3 of 14 bases will be 20% faster?
About 4% overall speed-up for GCW LLR?
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

Message 109355 - Posted: 14 Aug 2017 | 20:52:19 UTC - in response to Message 109353.

So, 3 of 14 bases will be 20% faster?
About 4% overall speed-up for GCW LLR?

Something like that, yes.
____________
My lucky number is 75898524288+1

Message 109381 - Posted: 16 Aug 2017 | 7:53:38 UTC - in response to Message 109348.

You would be surprised at how utterly non-trivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.

Is the base the only thing normalized or do you normalize k as well (the latter is applicable for all the bases, not just the square ones)?

Message 109382 - Posted: 16 Aug 2017 | 11:02:29 UTC - in response to Message 109381.

You would be surprised at how utterly non-trivial it turned out to be. But it is done. Thanks for pushing us along in the right direction.

Is the base the only thing normalized or do you normalize k as well (the latter is applicable for all the bases, not just the square ones)?

Just the base. In our tests there was no advantage to normalizing k.
____________
My lucky number is 75898524288+1

Message 109387 - Posted: 16 Aug 2017 | 14:39:55 UTC - in response to Message 109382.

Just the base. In our tests there was no advantage to normalizing k.

Hmmm... That was ... unexpected! Can you give me the set of (n,b) numbers used to test this? I am assuming that you used LLR's setup feature to get the FFTs?

Message 109390 - Posted: 16 Aug 2017 | 23:46:46 UTC

Speaking as the person who made the code changes, we are in fact reducing k for all bases. I was supposed to remove that code, but chose to leave it in. I neglected to tell Mike about it until now. My real life is a bit busy at the moment, so sometimes I'm forgetting things like that.

while (\$k % \$b == 0) { \$k /= \$b; \$n++; }

Message 109403 - Posted: 18 Aug 2017 | 2:54:20 UTC

Maximizing the return the challenge will have. Excellent!
____________ Message 109405 - Posted: 18 Aug 2017 | 4:35:18 UTC - in response to Message 109390.

Speaking as the person who made the code changes, we are in fact reducing k for all bases. I was supposed to remove that code, but chose to leave it in. I neglected to tell Mike about it until now. My real life is a bit busy at the moment, so sometimes I'm forgetting things like that.

while (\$k % \$b == 0) { \$k /= \$b; \$n++; }

LOL! Well it doesn't hurt. But I replicated the result, and Mike's right -- there is no need to normalize the k, since apparently LLR (or perhaps gwnum library) is doing it. I can see that when k is a multiple of base, it chooses a lower FFT (compared to adjacent k's), even without explicit normalizing.
Sorry about that -- I should've done my homework before posting about it.

Message 109411 - Posted: 18 Aug 2017 | 23:22:01 UTC - in response to Message 109405.

Hmm, a process akin to normalization could be responsible for the WTF effect, which is using a timing sidechannel during sieving to "discover" small primes in the blocking factor. So far there is no other explanation for that weirdness.

Message 109548 - Posted: 24 Aug 2017 | 7:31:14 UTC

While discussed and implement trick with b=25,49,121 brings about 4% speed-up, recently found prime makes GCW yet another 7% faster on top of that. Nice!
____________
My stats
Badge score: 1*1 + 5*1 + 8*3 + 9*11 + 10*1 + 11*1 + 12*3 = 186

Message boards : Generalized Cullen/Woodall prime search : How to speed up (some) GCW units by 10%+