calibrate.md

Documentation: math/big

     1# Calibration of Algorithm Thresholds
     2
     3This document describes the approach to calibration of algorithmic thresholds in
     4`math/big`, implemented in [calibrate_test.go](calibrate_test.go).
     5
     6Basic operations like multiplication and division have many possible implementations.
     7Most algorithms that are better asymptotically have overheads that make them
     8run slower for small inputs. When presented with an operation to run, `math/big`
     9must decide which algorithm to use.
    10
    11For example, for small inputs, multiplication using the “grade school algorithm” is fastest.
    12Given multi-digit x, y and a target z: clear z, and then for each digit y[i], z[i:] += x\*y[i].
    13That last operation, adding a vector times a digit to another vector (including carrying up
    14the vector during the multiplication and addition), can be implemented in a tight assembly loop.
    15The overall speed is O(N\*\*2) where N is the number of digits in x and y (assume they match),
    16but the tight inner loop performs well for small inputs.
    17
    18[Karatsuba's algorithm](https://en.wikipedia.org/wiki/Karatsuba_algorithm)
    19multiplies two N-digit numbers by splitting them in half, computing
    20three N/2-digit products, and then reconstructing the final product using a few more
    21additions and subtractions. It runs in O(N\*\*log₂ 3) = O(N\*\*1.58) time.
    22The grade school loop runs faster for small inputs,
    23but eventually Karatsuba's smaller asymptotic run time wins.
    24
    25The multiplication implementation must decide which to use.
    26Under the assumption that once Karatsuba is faster for some N,
    27it will be larger for all larger N as well,
    28the rule is to use Karatsuba's algorithm when the input length N ≥ karatsubaThreshold.
    29
    30Calibration is the process of determining what karatsubaThreshold should be set to.
    31It doesn't sound like it should be that hard, but it is:
    32- Theoretical analysis does not help: the answer depends on the actual machines
    33and the actual constant factors in the two implementations.
    34- We are picking a single karatsubaThreshold for all systems,
    35despite them having different relative execution speeds for the operations
    36in the two algorithms.
    37(We could in theory pick different thresholds for different architectures,
    38but there can still be significant variation within a given architecture.)
    39- The assumption that there is a single N where
    40an asymptotically better algorithm becomes faster and stays faster
    41is not true in general.
    42- Recursive algorithms like Karatsuba's may have  different optimal
    43thresholds for different large input sizes.
    44- Thresholds can interfere. For example, changing the karatsubaThreshold makes
    45multiplication faster or slower, which in turn affects the best divRecursiveThreshold
    46(because divisions use multiplication).
    47
    48The best we can do is measure the performance of the overall multiplication
    49algorithm across a variety of inputs and thresholds and look for a threshold
    50that balances all these concerns reasonably well,
    51setting thresholds in dependency order (for example, multiplication before division).
    52
    53The code in `calibrate_test.go` does this measurement of a variety of input sizes
    54and threshold values and prints the timing results as a CSV file.
    55The code in `calibrate_graph.go` reads the CSV and writes out an SVG file plotting the data.
    56For example:
    57
    58	go test -run=Calibrate/KaratsubaMul -timeout=1h -calibrate >kmul.csv
    59	go run calibrate_graph.go kmul.csv >kmul.svg
    60
    61Any particular input is sensitive to only a few transitions in threshold.
    62For example, an input of size 320 recurses on inputs of size 160,
    63which recurses on inputs of size 80,
    64which recurses on inputs of size 40,
    65and so on, until falling below the Karatsuba threshold.
    66Here is what the timing looks like for an input of size 320,
    67normalized so that 1.0 is the fastest timing observed:
    68
    69![KaratsubaThreshold on an Apple M3 Pro, N=320 only](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.mac320.svg)
    70
    71For this input, all thresholds from 21 to 40 perform optimally and identically: they all mean “recurse at N=40 but not at N=20”.
    72From the single input of size N=320, we cannot decide which of these 20 thresholds is best.
    73
    74Other inputs exercise other decision points. For example, here is the timing for N=240:
    75
    76![KaratsubaThreshold on an Apple M3 Pro, N=240 only](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.mac240.svg)
    77
    78In this case, all the thresholds from 31 to 60 perform optimally and identically, recursing at N=60 but not N=30.
    79
    80If we combine these two into a single graph and then plot the geometric mean of the two lines in blue,
    81the optimal range becomes a little clearer:
    82
    83![KaratsubaThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.mac240+320.svg)
    84
    85The actual calibration runs all possible inputs from size N=200 to N=400, in increments of 8,
    86plotting all 26 lines in a faded gray (note the changed y-axis scale, zooming in near 1.0).
    87
    88![KaratsubaThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.mac.svg)
    89
    90Now the optimal value is clear: the best threshold on this chip, with these algorithmic implementations, is 40.
    91
    92Unfortunately, other chips are different. Here is an Intel Xeon server chip:
    93
    94![KaratsubaThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.c2s16.svg)
    95
    96On this chip, the best threshold is closer to 60. Luckily, 40 is not a terrible choice either: it is only about 2% slower on average.
    97
    98The rest of this document presents the timings measured for the `math/big` thresholds on a variety of machines
    99and justifies the final thresholds. The timings used these machines:
   100
   101- The `gotip-linux-amd64_c3h88-perf_vs_release` gomote, a Google Cloud c3-high-88 machine using an Intel Xeon Platinum 8481C CPU (Emerald Rapids).
   102- The `gotip-linux-amd64_c2s16-perf_vs_release` gomote, a Google Cloud c2-standard-16 machine using an Intel Xeon Gold 6253CL CPU (Cascade Lake).
   103- A home server built with an AMD Ryzen 9 7950X CPU.
   104- The `gotip-linux-arm64_c4as16-perf_vs_release` gomote, a Google Cloud c4a-standard-16 machine using Google's Axiom Arm CPU.
   105- An Apple MacBook Pro with an Apple M3 Pro CPU.
   106
   107In general, we break ties in favor of the newer c3h88 x86 perf gomote, then the c4as16 arm64 perf gomote, and then the others.
   108
   109## Karatsuba Multiplication
   110
   111Here are the full results for the Karatsuba multiplication threshold.
   112
   113![KaratsubaThreshold on an Intel Xeon Platium 8481C](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.c3h88.svg)
   114![KaratsubaThreshold on an Intel Xeon Gold 6253CL](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.c2s16.svg)
   115![KaratsubaThreshold on an AMD Ryzen 9 7950X](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.s7.svg)
   116![KaratsubaThreshold on an Axiom Arm](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.c4as16.svg)
   117![KaratsubaThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/KaratsubaMul/cal.mac.svg)
   118
   119The majority of systems have optimum thresholds near 40, so we chose karatsubaThreshold = 40.
   120
   121## Basic Squaring
   122
   123For squaring a number (`z.Mul(x, x)`), math/big uses grade school multiplication
   124up to basicSqrThreshold, where it switches to a customized algorithm that is
   125still quadratic but avoids half the word-by-word multiplies
   126since the two arguments are identical.
   127That algorithm's inner loops are not as tight as the grade school multiplication,
   128so it is slower for small inputs. How small?
   129
   130Here are the timings:
   131
   132![BasicSqrThreshold on an Intel Xeon Platium 8481C](https://swtch.com/math/big/_calibrate/BasicSqr/cal.c3h88.svg)
   133![BasicSqrThreshold on an Intel Xeon Gold 6253CL](https://swtch.com/math/big/_calibrate/BasicSqr/cal.c2s16.svg)
   134![BasicSqrThreshold on an AMD Ryzen 9 7950X](https://swtch.com/math/big/_calibrate/BasicSqr/cal.s7.svg)
   135![BasicSqrThreshold on an Axiom Arm](https://swtch.com/math/big/_calibrate/BasicSqr/cal.c4as16.svg)
   136![BasicSqrThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/BasicSqr/cal.mac.svg)
   137
   138These inputs are so small that the calibration times batches of 100 instead of individual operations.
   139There is no one best threshold, even on a single system, because some of the sizes seem to run
   140the grade school algorithm faster than others.
   141For example, on the AMD CPU,
   142for N=14, basic squaring is 4% faster than basic multiplication,
   143suggesting the threshold has been crossed,
   144but for N=16, basic multiplication is 9% faster than basic squaring,
   145probably because the tight assembly can use larger chunks.
   146
   147It is unclear why the Axiom Arm timings are so incredibly noisy.
   148
   149We chose basicSqrThreshold = 12.
   150
   151## Karatsuba Squaring
   152
   153Beyond the basic squaring threshold, at some point a customized Karatsuba can take over.
   154It uses three half-sized squarings instead of three half-sized multiplies.
   155Here are the timings:
   156
   157![KaratsubaSqrThreshold on an Intel Xeon Platium 8481C](https://swtch.com/math/big/_calibrate/KaratsubaSqr/cal.c3h88.svg)
   158![KaratsubaSqrThreshold on an Intel Xeon Gold 6253CL](https://swtch.com/math/big/_calibrate/KaratsubaSqr/cal.c2s16.svg)
   159![KaratsubaSqrThreshold on an AMD Ryzen 9 7950X](https://swtch.com/math/big/_calibrate/KaratsubaSqr/cal.s7.svg)
   160![KaratsubaSqrThreshold on an Axiom Arm](https://swtch.com/math/big/_calibrate/KaratsubaSqr/cal.c4as16.svg)
   161![KaratsubaSqrThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/KaratsubaSqr/cal.mac.svg)
   162
   163The majority of chips preferred a lower threshold, around 60-70,
   164but the older Intel Xeon and the AMD prefer a threshold around 100-120.
   165
   166We chose karatsubaSqrThreshold = 80, which is within 2% of optimal on all the chips.
   167
   168## Recursive Division
   169
   170Division uses a recursive divide-and-conquer algorithm for large inputs,
   171eventually falling back to a more traditional grade-school whole-input trial-and-error division.
   172Here are the timings for the threshold between the two:
   173
   174![DivRecursiveThreshold on an Intel Xeon Platium 8481C](https://swtch.com/math/big/_calibrate/DivRecursive/cal.c3h88.svg)
   175![DivRecursiveThreshold on an Intel Xeon Gold 6253CL](https://swtch.com/math/big/_calibrate/DivRecursive/cal.c2s16.svg)
   176![DivRecursiveThreshold on an AMD Ryzen 9 7950X](https://swtch.com/math/big/_calibrate/DivRecursive/cal.s7.svg)
   177![DivRecursiveThreshold on an Axiom Arm](https://swtch.com/math/big/_calibrate/DivRecursive/cal.c4as16.svg)
   178![DivRecursiveThreshold on an Apple M3 Pro](https://swtch.com/math/big/_calibrate/DivRecursive/cal.mac.svg)
   179
   180We chose divRecursiveThreshold = 40.
View as plain text