Calibration of Algorithm Thresholds
This document describes the approach to calibration of algorithmic thresholds in
math/big
, implemented in calibrate_test.go.
Basic operations like multiplication and division have many possible implementations.
Most algorithms that are better asymptotically have overheads that make them
run slower for small inputs. When presented with an operation to run, math/big
must decide which algorithm to use.
For example, for small inputs, multiplication using the “grade school algorithm” is fastest. Given multi-digit x, y and a target z: clear z, and then for each digit y[i], z[i:] += x*y[i]. That last operation, adding a vector times a digit to another vector (including carrying up the vector during the multiplication and addition), can be implemented in a tight assembly loop. The overall speed is O(N**2) where N is the number of digits in x and y (assume they match), but the tight inner loop performs well for small inputs.
Karatsuba’s algorithm multiplies two N-digit numbers by splitting them in half, computing three N/2-digit products, and then reconstructing the final product using a few more additions and subtractions. It runs in O(N**log₂ 3) = O(N**1.58) time. The grade school loop runs faster for small inputs, but eventually Karatsuba’s smaller asymptotic run time wins.
The multiplication implementation must decide which to use. Under the assumption that once Karatsuba is faster for some N, it will be larger for all larger N as well, the rule is to use Karatsuba’s algorithm when the input length N ≥ karatsubaThreshold.
Calibration is the process of determining what karatsubaThreshold should be set to. It doesn’t sound like it should be that hard, but it is:
- Theoretical analysis does not help: the answer depends on the actual machines and the actual constant factors in the two implementations.
- We are picking a single karatsubaThreshold for all systems, despite them having different relative execution speeds for the operations in the two algorithms. (We could in theory pick different thresholds for different architectures, but there can still be significant variation within a given architecture.)
- The assumption that there is a single N where an asymptotically better algorithm becomes faster and stays faster is not true in general.
- Recursive algorithms like Karatsuba’s may have different optimal thresholds for different large input sizes.
- Thresholds can interfere. For example, changing the karatsubaThreshold makes multiplication faster or slower, which in turn affects the best divRecursiveThreshold (because divisions use multiplication).
The best we can do is measure the performance of the overall multiplication algorithm across a variety of inputs and thresholds and look for a threshold that balances all these concerns reasonably well, setting thresholds in dependency order (for example, multiplication before division).
The code in calibrate_test.go
does this measurement of a variety of input sizes
and threshold values and prints the timing results as a CSV file.
The code in calibrate_graph.go
reads the CSV and writes out an SVG file plotting the data.
For example:
go test -run=Calibrate/KaratsubaMul -timeout=1h -calibrate >kmul.csv
go run calibrate_graph.go kmul.csv >kmul.svg
Any particular input is sensitive to only a few transitions in threshold. For example, an input of size 320 recurses on inputs of size 160, which recurses on inputs of size 80, which recurses on inputs of size 40, and so on, until falling below the Karatsuba threshold. Here is what the timing looks like for an input of size 320, normalized so that 1.0 is the fastest timing observed:
For this input, all thresholds from 21 to 40 perform optimally and identically: they all mean “recurse at N=40 but not at N=20”. From the single input of size N=320, we cannot decide which of these 20 thresholds is best.
Other inputs exercise other decision points. For example, here is the timing for N=240:
In this case, all the thresholds from 31 to 60 perform optimally and identically, recursing at N=60 but not N=30.
If we combine these two into a single graph and then plot the geometric mean of the two lines in blue, the optimal range becomes a little clearer:
The actual calibration runs all possible inputs from size N=200 to N=400, in increments of 8, plotting all 26 lines in a faded gray (note the changed y-axis scale, zooming in near 1.0).
Now the optimal value is clear: the best threshold on this chip, with these algorithmic implementations, is 40.
Unfortunately, other chips are different. Here is an Intel Xeon server chip:
On this chip, the best threshold is closer to 60. Luckily, 40 is not a terrible choice either: it is only about 2% slower on average.
The rest of this document presents the timings measured for the math/big
thresholds on a variety of machines
and justifies the final thresholds. The timings used these machines:
- The
gotip-linux-amd64_c3h88-perf_vs_release
gomote, a Google Cloud c3-high-88 machine using an Intel Xeon Platinum 8481C CPU (Emerald Rapids). - The
gotip-linux-amd64_c2s16-perf_vs_release
gomote, a Google Cloud c2-standard-16 machine using an Intel Xeon Gold 6253CL CPU (Cascade Lake). - A home server built with an AMD Ryzen 9 7950X CPU.
- The
gotip-linux-arm64_c4as16-perf_vs_release
gomote, a Google Cloud c4a-standard-16 machine using Google’s Axiom Arm CPU. - An Apple MacBook Pro with an Apple M3 Pro CPU.
In general, we break ties in favor of the newer c3h88 x86 perf gomote, then the c4as16 arm64 perf gomote, and then the others.
Karatsuba Multiplication
Here are the full results for the Karatsuba multiplication threshold.
The majority of systems have optimum thresholds near 40, so we chose karatsubaThreshold = 40.
Basic Squaring
For squaring a number (z.Mul(x, x)
), math/big uses grade school multiplication
up to basicSqrThreshold, where it switches to a customized algorithm that is
still quadratic but avoids half the word-by-word multiplies
since the two arguments are identical.
That algorithm’s inner loops are not as tight as the grade school multiplication,
so it is slower for small inputs. How small?
Here are the timings:
These inputs are so small that the calibration times batches of 100 instead of individual operations. There is no one best threshold, even on a single system, because some of the sizes seem to run the grade school algorithm faster than others. For example, on the AMD CPU, for N=14, basic squaring is 4% faster than basic multiplication, suggesting the threshold has been crossed, but for N=16, basic multiplication is 9% faster than basic squaring, probably because the tight assembly can use larger chunks.
It is unclear why the Axiom Arm timings are so incredibly noisy.
We chose basicSqrThreshold = 12.
Karatsuba Squaring
Beyond the basic squaring threshold, at some point a customized Karatsuba can take over. It uses three half-sized squarings instead of three half-sized multiplies. Here are the timings:
The majority of chips preferred a lower threshold, around 60-70, but the older Intel Xeon and the AMD prefer a threshold around 100-120.
We chose karatsubaSqrThreshold = 80, which is within 2% of optimal on all the chips.
Recursive Division
Division uses a recursive divide-and-conquer algorithm for large inputs, eventually falling back to a more traditional grade-school whole-input trial-and-error division. Here are the timings for the threshold between the two:
We chose divRecursiveThreshold = 40.