Calibration of Algorithm Thresholds

This document describes the approach to calibration of algorithmic thresholds in math/big, implemented in calibrate_test.go.

Basic operations like multiplication and division have many possible implementations. Most algorithms that are better asymptotically have overheads that make them run slower for small inputs. When presented with an operation to run, math/big must decide which algorithm to use.

For example, for small inputs, multiplication using the “grade school algorithm” is fastest. Given multi-digit x, y and a target z: clear z, and then for each digit y[i], z[i:] += x*y[i]. That last operation, adding a vector times a digit to another vector (including carrying up the vector during the multiplication and addition), can be implemented in a tight assembly loop. The overall speed is O(N**2) where N is the number of digits in x and y (assume they match), but the tight inner loop performs well for small inputs.

Karatsuba’s algorithm multiplies two N-digit numbers by splitting them in half, computing three N/2-digit products, and then reconstructing the final product using a few more additions and subtractions. It runs in O(N**log₂ 3) = O(N**1.58) time. The grade school loop runs faster for small inputs, but eventually Karatsuba’s smaller asymptotic run time wins.

The multiplication implementation must decide which to use. Under the assumption that once Karatsuba is faster for some N, it will be larger for all larger N as well, the rule is to use Karatsuba’s algorithm when the input length N ≥ karatsubaThreshold.

Calibration is the process of determining what karatsubaThreshold should be set to. It doesn’t sound like it should be that hard, but it is:

Theoretical analysis does not help: the answer depends on the actual machines and the actual constant factors in the two implementations.
We are picking a single karatsubaThreshold for all systems, despite them having different relative execution speeds for the operations in the two algorithms. (We could in theory pick different thresholds for different architectures, but there can still be significant variation within a given architecture.)
The assumption that there is a single N where an asymptotically better algorithm becomes faster and stays faster is not true in general.
Recursive algorithms like Karatsuba’s may have different optimal thresholds for different large input sizes.
Thresholds can interfere. For example, changing the karatsubaThreshold makes multiplication faster or slower, which in turn affects the best divRecursiveThreshold (because divisions use multiplication).

The best we can do is measure the performance of the overall multiplication algorithm across a variety of inputs and thresholds and look for a threshold that balances all these concerns reasonably well, setting thresholds in dependency order (for example, multiplication before division).

The code in calibrate_test.go does this measurement of a variety of input sizes and threshold values and prints the timing results as a CSV file. The code in calibrate_graph.go reads the CSV and writes out an SVG file plotting the data. For example:

go test -run=Calibrate/KaratsubaMul -timeout=1h -calibrate >kmul.csv
go run calibrate_graph.go kmul.csv >kmul.svg

Any particular input is sensitive to only a few transitions in threshold. For example, an input of size 320 recurses on inputs of size 160, which recurses on inputs of size 80, which recurses on inputs of size 40, and so on, until falling below the Karatsuba threshold. Here is what the timing looks like for an input of size 320, normalized so that 1.0 is the fastest timing observed:

$KaratsubaThreshold on an Apple M3 Pro, N=320 only$

For this input, all thresholds from 21 to 40 perform optimally and identically: they all mean “recurse at N=40 but not at N=20”. From the single input of size N=320, we cannot decide which of these 20 thresholds is best.

Other inputs exercise other decision points. For example, here is the timing for N=240:

$KaratsubaThreshold on an Apple M3 Pro, N=240 only$

In this case, all the thresholds from 31 to 60 perform optimally and identically, recursing at N=60 but not N=30.

If we combine these two into a single graph and then plot the geometric mean of the two lines in blue, the optimal range becomes a little clearer:

$KaratsubaThreshold on an Apple M3 Pro$

The actual calibration runs all possible inputs from size N=200 to N=400, in increments of 8, plotting all 26 lines in a faded gray (note the changed y-axis scale, zooming in near 1.0).

$KaratsubaThreshold on an Apple M3 Pro$

Now the optimal value is clear: the best threshold on this chip, with these algorithmic implementations, is 40.

Unfortunately, other chips are different. Here is an Intel Xeon server chip:

$KaratsubaThreshold on an Apple M3 Pro$

On this chip, the best threshold is closer to 60. Luckily, 40 is not a terrible choice either: it is only about 2% slower on average.

The rest of this document presents the timings measured for the math/big thresholds on a variety of machines and justifies the final thresholds. The timings used these machines:

The gotip-linux-amd64_c3h88-perf_vs_release gomote, a Google Cloud c3-high-88 machine using an Intel Xeon Platinum 8481C CPU (Emerald Rapids).
The gotip-linux-amd64_c2s16-perf_vs_release gomote, a Google Cloud c2-standard-16 machine using an Intel Xeon Gold 6253CL CPU (Cascade Lake).
A home server built with an AMD Ryzen 9 7950X CPU.
The gotip-linux-arm64_c4as16-perf_vs_release gomote, a Google Cloud c4a-standard-16 machine using Google’s Axiom Arm CPU.
An Apple MacBook Pro with an Apple M3 Pro CPU.

In general, we break ties in favor of the newer c3h88 x86 perf gomote, then the c4as16 arm64 perf gomote, and then the others.

Karatsuba Multiplication

Here are the full results for the Karatsuba multiplication threshold.

$KaratsubaThreshold on an Intel Xeon Platium 8481C$ $KaratsubaThreshold on an Intel Xeon Gold 6253CL$ $KaratsubaThreshold on an AMD Ryzen 9 7950X$ $KaratsubaThreshold on an Axiom Arm$ $KaratsubaThreshold on an Apple M3 Pro$

The majority of systems have optimum thresholds near 40, so we chose karatsubaThreshold = 40.

Basic Squaring

For squaring a number (z.Mul(x, x)), math/big uses grade school multiplication up to basicSqrThreshold, where it switches to a customized algorithm that is still quadratic but avoids half the word-by-word multiplies since the two arguments are identical. That algorithm’s inner loops are not as tight as the grade school multiplication, so it is slower for small inputs. How small?

Here are the timings:

$BasicSqrThreshold on an Intel Xeon Platium 8481C$ $BasicSqrThreshold on an Intel Xeon Gold 6253CL$ $BasicSqrThreshold on an AMD Ryzen 9 7950X$ $BasicSqrThreshold on an Axiom Arm$ $BasicSqrThreshold on an Apple M3 Pro$

These inputs are so small that the calibration times batches of 100 instead of individual operations. There is no one best threshold, even on a single system, because some of the sizes seem to run the grade school algorithm faster than others. For example, on the AMD CPU, for N=14, basic squaring is 4% faster than basic multiplication, suggesting the threshold has been crossed, but for N=16, basic multiplication is 9% faster than basic squaring, probably because the tight assembly can use larger chunks.

It is unclear why the Axiom Arm timings are so incredibly noisy.

We chose basicSqrThreshold = 12.

Karatsuba Squaring

Beyond the basic squaring threshold, at some point a customized Karatsuba can take over. It uses three half-sized squarings instead of three half-sized multiplies. Here are the timings:

$KaratsubaSqrThreshold on an Intel Xeon Platium 8481C$ $KaratsubaSqrThreshold on an Intel Xeon Gold 6253CL$ $KaratsubaSqrThreshold on an AMD Ryzen 9 7950X$ $KaratsubaSqrThreshold on an Axiom Arm$ $KaratsubaSqrThreshold on an Apple M3 Pro$

The majority of chips preferred a lower threshold, around 60-70, but the older Intel Xeon and the AMD prefer a threshold around 100-120.

We chose karatsubaSqrThreshold = 80, which is within 2% of optimal on all the chips.

Recursive Division

Division uses a recursive divide-and-conquer algorithm for large inputs, eventually falling back to a more traditional grade-school whole-input trial-and-error division. Here are the timings for the threshold between the two:

$DivRecursiveThreshold on an Intel Xeon Platium 8481C$ $DivRecursiveThreshold on an Intel Xeon Gold 6253CL$ $DivRecursiveThreshold on an AMD Ryzen 9 7950X$ $DivRecursiveThreshold on an Axiom Arm$ $DivRecursiveThreshold on an Apple M3 Pro$

We chose divRecursiveThreshold = 40.