Text file
src/math/big/calibrate.md
1# Calibration of Algorithm Thresholds
2
3This document describes the approach to calibration of algorithmic thresholds in
4`math/big`, implemented in [calibrate_test.go](calibrate_test.go).
5
6Basic operations like multiplication and division have many possible implementations.
7Most algorithms that are better asymptotically have overheads that make them
8run slower for small inputs. When presented with an operation to run, `math/big`
9must decide which algorithm to use.
10
11For example, for small inputs, multiplication using the “grade school algorithm” is fastest.
12Given multi-digit x, y and a target z: clear z, and then for each digit y[i], z[i:] += x\*y[i].
13That last operation, adding a vector times a digit to another vector (including carrying up
14the vector during the multiplication and addition), can be implemented in a tight assembly loop.
15The overall speed is O(N\*\*2) where N is the number of digits in x and y (assume they match),
16but the tight inner loop performs well for small inputs.
17
18[Karatsuba's algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
19multiplies two N-digit numbers by splitting them in half, computing
20three N/2-digit products, and then reconstructing the final product using a few more
21additions and subtractions. It runs in O(N\*\*log₂ 3) = O(N\*\*1.58) time.
22The grade school loop runs faster for small inputs,
23but eventually Karatsuba's smaller asymptotic run time wins.
24
25The multiplication implementation must decide which to use.
26Under the assumption that once Karatsuba is faster for some N,
27it will be larger for all larger N as well,
28the rule is to use Karatsuba's algorithm when the input length N ≥ karatsubaThreshold.
29
30Calibration is the process of determining what karatsubaThreshold should be set to.
31It doesn't sound like it should be that hard, but it is:
32- Theoretical analysis does not help: the answer depends on the actual machines
33and the actual constant factors in the two implementations.
34- We are picking a single karatsubaThreshold for all systems,
35despite them having different relative execution speeds for the operations
36in the two algorithms.
37(We could in theory pick different thresholds for different architectures,
38but there can still be significant variation within a given architecture.)
39- The assumption that there is a single N where
40an asymptotically better algorithm becomes faster and stays faster
41is not true in general.
42- Recursive algorithms like Karatsuba's may have different optimal
43thresholds for different large input sizes.
44- Thresholds can interfere. For example, changing the karatsubaThreshold makes
45multiplication faster or slower, which in turn affects the best divRecursiveThreshold
46(because divisions use multiplication).
47
48The best we can do is measure the performance of the overall multiplication
49algorithm across a variety of inputs and thresholds and look for a threshold
50that balances all these concerns reasonably well,
51setting thresholds in dependency order (for example, multiplication before division).
52
53The code in `calibrate_test.go` does this measurement of a variety of input sizes
54and threshold values and prints the timing results as a CSV file.
55The code in `calibrate_graph.go` reads the CSV and writes out an SVG file plotting the data.
56For example:
57
58 go test -run=Calibrate/KaratsubaMul -timeout=1h -calibrate >kmul.csv
59 go run calibrate_graph.go kmul.csv >kmul.svg
60
61Any particular input is sensitive to only a few transitions in threshold.
62For example, an input of size 320 recurses on inputs of size 160,
63which recurses on inputs of size 80,
64which recurses on inputs of size 40,
65and so on, until falling below the Karatsuba threshold.
66Here is what the timing looks like for an input of size 320,
67normalized so that 1.0 is the fastest timing observed:
68
69
70
71For this input, all thresholds from 21 to 40 perform optimally and identically: they all mean “recurse at N=40 but not at N=20”.
72From the single input of size N=320, we cannot decide which of these 20 thresholds is best.
73
74Other inputs exercise other decision points. For example, here is the timing for N=240:
75
76
77
78In this case, all the thresholds from 31 to 60 perform optimally and identically, recursing at N=60 but not N=30.
79
80If we combine these two into a single graph and then plot the geometric mean of the two lines in blue,
81the optimal range becomes a little clearer:
82
83
84
85The actual calibration runs all possible inputs from size N=200 to N=400, in increments of 8,
86plotting all 26 lines in a faded gray (note the changed y-axis scale, zooming in near 1.0).
87
88
89
90Now the optimal value is clear: the best threshold on this chip, with these algorithmic implementations, is 40.
91
92Unfortunately, other chips are different. Here is an Intel Xeon server chip:
93
94
95
96On this chip, the best threshold is closer to 60. Luckily, 40 is not a terrible choice either: it is only about 2% slower on average.
97
98The rest of this document presents the timings measured for the `math/big` thresholds on a variety of machines
99and justifies the final thresholds. The timings used these machines:
100
101- The `gotip-linux-amd64_c3h88-perf_vs_release` gomote, a Google Cloud c3-high-88 machine using an Intel Xeon Platinum 8481C CPU (Emerald Rapids).
102- The `gotip-linux-amd64_c2s16-perf_vs_release` gomote, a Google Cloud c2-standard-16 machine using an Intel Xeon Gold 6253CL CPU (Cascade Lake).
103- A home server built with an AMD Ryzen 9 7950X CPU.
104- The `gotip-linux-arm64_c4as16-perf_vs_release` gomote, a Google Cloud c4a-standard-16 machine using Google's Axiom Arm CPU.
105- An Apple MacBook Pro with an Apple M3 Pro CPU.
106
107In general, we break ties in favor of the newer c3h88 x86 perf gomote, then the c4as16 arm64 perf gomote, and then the others.
108
109## Karatsuba Multiplication
110
111Here are the full results for the Karatsuba multiplication threshold.
112
113
114
115
116
117
118
119The majority of systems have optimum thresholds near 40, so we chose karatsubaThreshold = 40.
120
121## Basic Squaring
122
123For squaring a number (`z.Mul(x, x)`), math/big uses grade school multiplication
124up to basicSqrThreshold, where it switches to a customized algorithm that is
125still quadratic but avoids half the word-by-word multiplies
126since the two arguments are identical.
127That algorithm's inner loops are not as tight as the grade school multiplication,
128so it is slower for small inputs. How small?
129
130Here are the timings:
131
132
133
134
135
136
137
138These inputs are so small that the calibration times batches of 100 instead of individual operations.
139There is no one best threshold, even on a single system, because some of the sizes seem to run
140the grade school algorithm faster than others.
141For example, on the AMD CPU,
142for N=14, basic squaring is 4% faster than basic multiplication,
143suggesting the threshold has been crossed,
144but for N=16, basic multiplication is 9% faster than basic squaring,
145probably because the tight assembly can use larger chunks.
146
147It is unclear why the Axiom Arm timings are so incredibly noisy.
148
149We chose basicSqrThreshold = 12.
150
151## Karatsuba Squaring
152
153Beyond the basic squaring threshold, at some point a customized Karatsuba can take over.
154It uses three half-sized squarings instead of three half-sized multiplies.
155Here are the timings:
156
157
158
159
160
161
162
163The majority of chips preferred a lower threshold, around 60-70,
164but the older Intel Xeon and the AMD prefer a threshold around 100-120.
165
166We chose karatsubaSqrThreshold = 80, which is within 2% of optimal on all the chips.
167
168## Recursive Division
169
170Division uses a recursive divide-and-conquer algorithm for large inputs,
171eventually falling back to a more traditional grade-school whole-input trial-and-error division.
172Here are the timings for the threshold between the two:
173
174
175
176
177
178
179
180We chose divRecursiveThreshold = 40.
View as plain text