Discovering faster matrix multiplication algorithms with reinforcement learning

Improving the efficiency of algorithms for fundamental computations can have a widespread impact, as it can affect the overall speed of a large amount of computations. Matrix multiplication is one such primitive task, occurring in many systems—from neural networks to scientific computing routines. The automatic discovery of algorithms using machine learning offers the prospect of reaching beyond human intuition and outperforming the current best human-designed algorithms. However, automating the algorithm discovery procedure is intricate, as the space of possible algorithms is enormous. Here we report a deep reinforcement learning approach based on AlphaZero1 for discovering efficient and provably correct algorithms for the multiplication of arbitrary matrices. Our agent, AlphaTensor, is trained to play a single-player game where the objective is finding tensor decompositions within a finite factor space. AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago2. We further showcase the flexibility of AlphaTensor through different use-cases: algorithms with state-of-the-art complexity for structured matrix multiplication and improved practical efficiency by optimizing matrix multiplication for runtime on specific hardware. Our results highlight AlphaTensor’s ability to accelerate the process of algorithmic discovery on a range of problems, and to optimize for different criteria.

def PolicyHead_training e ∈ R m×c , g ∈ {1, . . . , N logits } Nsteps # Training by teacher-forcing: 1: o, z = predict_action_logits(onehot(shifted(g), N logits ), e) # Returns only z 1 because it is the only one not conditioned on ground truth. 2: return o, z 1 Algorithm A. 6 Policy head behaviour at inference time. It returns the sampled action, and its estimated probability (and the embeddings of the first step).
▷ Sample the rank of the synthetic demonstration. 3: for r = 1 to R do 4: repeat 5: for s = 1 to S do 6 ▷ Reconstruct tensor corresponding to the random factors. 11: Algorithm A.12 shows the procedure we used for generating the synthetic demonstrations. The N data points are sampled i.i.d., corresponding to the outer loop on Line 1. When generating demonstrations in a ring E, factor entries are sampled from the set E and the arithmetic (multiplications and additions in Line 10) is performed in this ring, e.g., in Z 2 this corresponds to applying an additional modulo operation.

A.3 Algorithm for generating changes of basis
Algorithm A.13: Generation of basis change matrices Input: Tensor size S, matrix entry distribution p cob-entry , desired number of basis changes N cob , random seed. Output: N cob i.i.d. invertible unimodular S × S matrices.
1: for n = 1 to N cob do 2: P ← zero S × S matrix ▷ Will be upper-triangular.

3:
L ← zero S × S matrix ▷ Will be lower-triangular. 4: for i = 1 to S do 5: for i = 1 to S do 8: for j = 1 to i − 1 do 9: L ij ∼ p cob-entry 10: for j = i + 1 to S do 11: P ij ∼ p cob-entry 12: yield PL Algorithm A.13 shows the procedure for generating basis change matrices. We chose to generate unimodular change of bases for numerical stability of the obtained factorizations after converting them back into the canonical basis. The matrices generated by Algorithm A.13 are unimodular because the determinant of a triangular matrix equals the product of its diagonal elements (−1 or +1 in this case), and so det(PL) = det P det L ∈ {−1, +1} by construction. In our experiments we used p cob-entry (0) = 0.985 and p cob-entry (−1) = p cob-entry (1) = 0.0075.
In acting, an actor is given a basis in which it should try to decompose the tensor (see Figure 2 in the main paper). With probability p canonical the basis given is the original (canonical) basis (i.e., no basis change is applied), and with probability 1 − p canonical one of the N cob generated basis changes is chosen uniformly at random. In our experiments N cob = 100,000 and p canonical = 250 100,000 ≈ 0.0017.

A.4 Converting modular arithmetic algorithms into general ones
As for the targets T 4 and T 5 AlphaTensor discovered algorithms of lower rank in modular arithmetic Z 2 than in standard arithmetic, we attempted to convert these modular arithmetic algorithms into general ones using a SAT solver. Specifically, given a factorization with entries F = {0, 1} valid in the ring E = Z 2 , we set up the problem of finding a factorization with entries F = {−1, 0, 1} valid in the standard arithmetic E = Z such that it has the same sparsity pattern but each 1 entry in the Z 2 factorization is turned into either a −1 or +1 entry in the Z factorization. This follows the approach of [1]. Unfortunately, the SAT solver was able to prove that it is not possible to convert our rank-47 algorithms for T 4 nor the rank-96 algorithms for T 5 from Z 2 into standard arithmetic in this way. This of course does not preclude that other algorithms of this rank may exist for these targets.

B Symmetries in matrix multiplication algorithms
In Section B.1, we define the equivalence relation between matrix multiplication algorithms. Section B.2 details the algorithm used for determining whether two algorithms are equivalent. Section B.3 provides details on the different non-equivalent factorizations found using AlphaTensor.
Notation. Recall that ⊗ defines the tensor product operation. That is, if T 1 and T 2 are two tensors with n and m dimensions, respectively, then T 1 ⊗ T 2 is the tensor with n + m dimensions where A slight variation of the tensor product is the Kronecker product ⊗ K , which we define between two tensors with the same number of dimensions: if T 1 and T 2 have sizes [s 1 , . . . , s n ] and [t 1 , . . . , t n ], respectively, For convenience, in Sections B.1-B.3 we will see factor vectors of T n as n × n matrices, rather than vectors of length n 2 , and hence denote them with capital letters U, V, W.
are factorizations of T then the following provides a rank R 2 factorization of T ⊗2 = T ⊗ K T: Finally, we denote as S n the group of permutations of n elements.

B.1 Equivalence of factorizations
Since the matrix multiplication tensors have a product structure (that is, , one can obtain using Eq. 1 a rank 49 factorization of T 4 from a rank 7 factorization (Strassen) of T 2 . Such product-structure factorizations are the only known schemes to decompose T 4 into 49 factors, and are referred to in the main paper as Strassen 2 . 1 Our goal here is to show that the factorizations obtained using AlphaTensor are not equivalent to the known factorization of product structure, and that these factorizations are pairwise nonequivalent. To do so, we first define the notion of equivalence.
When seen as a trilinear map, the (transposed) matrix multiplication operation is (X 1 , X 2 , X 3 ) → Trace(X 1 X 2 X 3 ). Using the invariance of trace to cyclic permutations, Trace(X 1 X 2 X 3 ) = Trace(X 2 X 3 X 1 ) = Trace(X 3 X 1 X 2 ). Since transposition does not change the trace either, we also have . Similarly, for any invertible matrices A, B, C, These invariances imply that, given a factorization {(U r , V r , W r )} R r=1 of T n , we can generate many equivalent factorizations of T n . Such symmetries have been heavily studied for n = 2 and n = 3 [3,4,5,6,7,1]. We use the group of symmetries introduced in [3], which proved that Strassen's algorithm for multiplying 2 × 2 matrices is unique up to the actions of this symmetry group.
A permutation π ∈ S 3 acts on a rank one tensor U ⊗ V ⊗ W by permuting the factors U, V, W with an additional transposition when the signature of π is −1. For example, for the cyclic permutation (1, 2, 3), we have (1, 2, 3) · (U ⊗ V ⊗ W) = V ⊗ W ⊗ U, and for the transposition (1, 2), we have (1, 2) · (U ⊗ V ⊗ W) = (V T ⊗ U T ⊗ W T ). Given three invertible matrices (A, B, C), their action on a rank one tensor is given by (A, B, C) · (U ⊗ V ⊗ W) = (AUB −1 ⊗ BVC −1 ⊗ CWA −1 ). The two actions are combined as follows: given (A, B, C, π), the permutation π is applied followed by the (A, B, C) action. For example, if π = (1, 2, 3), then (A, are said to be equivalent if there exists (A, B, C, π) that maps the two sets of factors through the action described above: that is, there exists a permutation σ ∈ S R such that for all r ∈ [R], we have We denote the equivalence as follows: . For more details on symmetries of factorizations, see e.g., [5].
Using this notion of equivalence, [3] showed that all rank 7 factorizations of T 2 are equivalent. We show that this is not the case for T 4 , and we exhibit thousands of nonequivalent rank 49 factorizations of T 4 . The procedure for certifying non-equivalence of factorizations is provided in Section B.2. 1 Note that Winograd's algorithm [2] involves 48 multiplications, but it only applies to commutative rings and does not give bounds on the tensor rank. Winograd's algorithm cannot be applied recursively on larger matrices.

B.2 Algorithm for certifying nonequivalence
We now explain the procedure we used to certify the nonequivalence of factorizations, and prove the result in Theorem B.1. To determine whether two factorizations are equivalent, we define invariants under the transformation introduced in Section B.1. The first invariant we use is the matrix rank invariant, given by where {·} denotes an unordered tuple. R clearly satisfies the invariance under transformations in Section B.1. Using this invariant, we obtained 38 distinct factorizations with different invariants R (amongst more than 50,000 factorizations found by AlphaTensor). This invariant has been used extensively in prior work to check for equivalent factorizations (see e.g., [5]).
To further distinguish between non-equivalent factorizations, we developed a new invariant. This invariant is only applicable to a specific subset of factorizations, but on this subset it is much more granular than R. More specifically, the new invariant is applicable to those factorizations that have exactly one factor such that the product of matrices is identity; i.e. without loss of generality U 1 V 1 W 1 = I. (Note that this property is itself invariant under the transformations in Section B.1.) This property was satisfied by many but not all factorizations found by AlphaTensor. For factorizations satisfying this property we then define the invariant as follows: action to each rank one tensor in the factorization, so that the first factor becomesŨ 1 The following theorem states that K is an invariant: We further check that the factorizations obtained using AlphaTensor are not of product structure. To do so, we rely on the following result, which shows that it is sufficient to check that {(U r , V r , W r )} 49 r=1 is nonequivalent to one factorization of product form to infer that it is nonequivalent to all factorizations of product form.
be a rank 7 factorization of T 2 (i.e., a Strassen scheme for multiplying 2 × 2 matrices). The following two statements are equivalent: where T π is the transpose operator when the signature of π is −1, and identity otherwise.
See Section I.2 for the proof.

B.3 Diversity of factorizations
The multitude of nonequivalent factorizations discovered by AlphaTensor demonstrates that this space is richer than previously thought. Without further manipulations or optimizing for different objectives, the rank-49 factorizations of T 4 discovered by AlphaTensor have the following properties: • Entries in {−4, −3, −2, −1, 0, 1, 2}. The reason we were able to discover factorizations with entries outside of the set F = {−2, −1, 0, 1, 2} that the network can predict is that after converting an F -valued factorization obtained in a different basis back into the canonical basis, its entries can end up outside of F . The entries are still integral because we use unimodular basis change matrices.
• Matrix ranks of the factor vectors (when seen as 4 × 4 matrices) in the factorizations form 38 different types [8,5,1,9]. Represented as a three-variable polynomial, the type of Strassen 2 is p(x, y, z) = x 4 y 4 z 4 + 12x 2 y 2 z 2 + 36xyz (there is 1 factor with all three vectors of matrix rank 4, 12 factors with all three vectors of matrix rank 2, and 36 factors with all three vectors of matrix rank 1), and AlphaTensor also discovers many factorizations of this type. However, AlphaTensor also discovers factorizations with 37 new types.
• Sparsity. The number of nonzeros in our discovered factorizations ranges from 455 to 636.
• Cyclicity. Around 0.2% of the discovered factorizations are cyclic (closed under permuting (U, V, W) with a cyclic element of S 3 ); the vast majority are noncyclic.
• Numerical stability. [10] identified two principal quantities that govern the numerical error bounds of matrix multiplication algorithms: a prefactor Q, and a stability factor E. Lower values correspond to tighter bounds. For the Strassen 2 algorithm Q = 24 and E = 144, while for the discovered factorizations Q ∈ [23, 39] and E ∈ [136, 1046]. In particular, AlphaTensor discovered an algorithm with Q = 23 and E = 136 without using a reward function that would optimize for numerical stability. Finally, the growth factor [11] of the factorizations is in the range from 220.14 to 547.26.

C Structured matrix multiplication
We apply AlphaTensor to more general operations beyond matrix multiplication. Just like standard matrix multiplication, any bilinear operation can be represented by a tensor, and its low-rank decompositions correspond to efficient algorithms. Bilinear operations encompass a large set of computational problems: structured matrix multiplication (e.g., symmetric matrix multiplication), polynomial multiplication, or more customized bilinear operations commonly used in machine learning models, such as graph networks [12,13]. Here, we explore two use-cases: circulant matrixvector product (or equivalently, circular convolution) in finite fields, and skew-symmetric matrix-vector product.
For the former use-case, our goal is to demonstrate the generality and power of AlphaTensor by discovering the Fourier basis, a complex and foundational transformation in mathematics, in a finite field, where arithmetic rules are non-intuitive. For the latter, the optimal algorithm is unknown [14], and we use AlphaTensor to discover an algorithm that improves over the state-of-the-art algorithm. We employ a common methodology for both use-cases: using AlphaTensor, we compute low-rank decompositions for tensors corresponding to small instances; then, following a human inspection, we generalize these algorithms to arbitrary-sized matrices/vectors. 2

C.1 Circular matrix-vector product in finite fields
The first problem we consider is the product between a n × n circulant matrix and a vector of length n. This operation is equivalent to computing the circular convolution between two vectors of length n: the first column of the matrix and the input vector. A naive implementation of this operation would have bilinear complexity O(n 2 ); in contrast, under certain conditions, there exists an optimal algorithm that achieves bilinear complexity O(n). This optimal algorithm consists of three steps: (i) obtaining the discrete Fourier transform (DFT) of each vector input (in a finite field, it is the cyclotomic DFT instead), (ii) multiplying (element-wise) the results, and (iii) obtaining the inverse DFT of the result.
Here we aim at rediscovering this optimal algorithm using AlphaTensor.
The tensor T circ n that represents the circular matrix-vector product has size n × n × n, and its entries are given by where δ{·} is the indicator function (we use zero-based indexing here). The tensor T circ n has therefore n 2 nonzero elements, but its rank is n (provided the n-length DFT exists in the corresponding ring E of interest, which is always the case if E = R).
We focus on computations in finite fields. In particular, we set E to be the finite field of order ρ, where ρ is a prime power. This is challenging for two reasons. First, the action space becomes huge even for moderate values of n and ρ, e.g., for n = 8 and ρ = 17 (which are values that we consider), the action space is of the order of ρ 3n ≈ 10 29 . Second, multiplications and additions are performed according to the arithmetic in the finite field, and therefore interpreting the factor and tensor entries as integers may be a poor choice (e.g., for ρ = 17, the values 1 and 16 are both 1 unit away from 0, since 0 + 1 = 1 and 16 + 1 = 0). To account for these challenges, we adapt AlphaTensor as described next.
Methodology. We now describe the modifications of AlphaTensor that we use for this case: • Vector-per-move. In the standard TensorGame, each move (action) corresponds to a triplet (u (t) , v (t) , w (t) ).
Here, we modify TensorGame so that the factor triplet is split into three separate moves. The first move of the game corresponds to u (1) , the second one to v (1) , the third one to w (1) , the fourth one to u (2) , and so on. Consequently, the state of the game s is augmented with the vectors already written down for the next factor: for each t ≥ 0, The tensor S t in the state is only updated after every third move, when a w (t) vector is written down. The update is then analogous to the standard TensorGame. Thus, the main difference of this variant is that there are three times as many moves to reach the same residual tensor. While this has no impact on the score achieved in the game (as long as the maximum number of moves is increased threefold), there is a computational advantage: the move size is 3× smaller (which implies an exponential reduction in the size of the set of possible moves), even at the expense of 3× longer games. • Network input. To help the network learn the arithmetic in finite fields, we pre-process the input as follows. For every input entry x, we compute all the products {x, 2x, . . . , (ρ − 1)x}. We concatenate the ρ − 1 resulting entries, pre-process them using four additional dense blocks, and expose the result as an additional input to both the torso and the policy head. • Action canonicalization reward. The action canonicalization is particularly important in finite fields, where a given rank-1 tensor can be expressed using (ρ − 1) 2 equivalent triplets (λ 1 u, λ 2 v, λ 3 w) with λ 1 λ 2 λ 1 = 1 (mod ρ); e.g., there are 256 equivalent triplets for ρ = 17. In a finite field we define the canonical form as having the first nonzero element of u and the first nonzero element of v equal to 1. Instead of imposing the canonicalization, we add a penalty to the reward whenever the agent outputs a factor triplet that is not in canonical form. Specifically, the penalty is −0.5 for each individual factor (u or v) that is not in canonical form. • Permutation symmetry reward. Two actions (u, v, w) and (u ′ , v ′ , w ′ ) can be applied in either order due to commutativity. Therefore, to reduce the number of equivalent factorizations, we use another type of canonicalization -defined over the whole factorization instead of over each individual factor. Specifically, we define the canonical representation of a factorization as the factorization in which the factors are sorted lexicographically. When the agent fails to output a factorization in its canonical form, we add a penalty reward equal to −1/R limit for each pair of consecutive actions that are not played in canonical order. This reward is added in addition to the factor canonicalization reward. All synthetic demonstrations are also converted to their canonical form (considering both factor canonicalization and permutation symmetry). • Synthetic demonstrations. When generating demonstrations, we ensure that the factor entries take all the possible values in the finite field. We set p entry (0) = 0.5 and p entry (v) = 1 ρ−1 for v = 1, . . . , ρ − 1. • Hyper-parameters. We set R limit = 64 and allow all elements of E as possible factor entries: 1}. The policy head uses N steps = 2, N logits = ρ 2 (i.e., 289 for ρ = 17), N layers = 4, and at inference time (in MCTS) we sample N samples = 64 actions. The value head is an implicit quantile network (IQN) head [15] with 8 samples, and it ignores the policy embedding input z 1 . Signed permutations are not used; instead we inject more diversity by setting N cob = 300,000.

Results.
We apply AlphaTensor to the tensor T circ n , setting ρ = 17 and 3 n = 2, 4, 8, in a single-target setting (i.e., a separate experiment for each n). Extended Data Table 2 shows the obtained solutions. Despite the difficulty of the task and the enormous action space, AlphaTensor finds the optimal decomposition for the considered values of n. These decompositions correspond precisely to the Fourier basis, from which patterns can be detected through visual inspection -we color-coded such patterns for clarity in Extended Data Table 2. From these examples, one can generalize the solution to arbitrary n and field by setting u (r) k = v (r) k = z kr and w (r) k = z −kr /n, where 0 ≤ k, r < n and z is an n-th primitive root of unity 4 (see below for details on this step). That is, U and V are the (cyclotomic) DFT matrices, while W corresponds to the inverse DFT. As the Fourier transform is known to be the optimal algorithm, this shows the capability of AlphaTensor to discover optimal algorithms in huge algorithm spaces.
Generalizing the factorizations found by AlphaTensor. Here we exemplify the process to generalize the factorizations found by AlphaTensor on small tensors on the finite field of order 17 to any arbitrarily sized tensor and finite field, which is the only step in the process that requires human intervention.
We start by showing an example for n = 4, but the procedure is analogous for the sizes n = 2 and n = 8. For n = 4, the raw factorization found by AlphaTensor is: First, we note that the matrices above can be made symmetric by permuting the factors (i.e., columns). This leads to the factorization 5 where we have additionally re-written W ′ by taking out the factor 4 −1 = 13 in order to make w ′(1) similar to u ′(1) .
Second, we note that the entries in the matrices above involve only a subset of the elements in E. In particular, they involve only the elements that are powers of 4 (since 1 = 4 0 , 4 = 4 1 , 16 = 4 2 , and 13 = 4 3 ). Taking this into account, and applying the identity 4 x = 4 x+4 (which holds in the finite field of order ρ = 17), we rewrite the factors in a form that is more convenient for generalization: z is an n-th primitive root of unity in the finite field -for example, an extrapolation for n = 16 in the same finite field can be built using z = 3. Note that this generalization is valid not only for other values of n; it is also not limited to ρ = 17 -it is applicable to any finite field and any n such that an n-th primitive root of unity exists in the field. Furthermore, the generalization is also valid beyond finite fields, as it also applies for complex-valued inputs by setting z to an n-th primitive root of unity, i.e., z = exp 2π √ −1 n .

C.2 Skew-symmetric matrix-vector product
Here, we describe the second use-case of structured matrix multiplication. We use AlphaTensor to discover a new algorithm for computing the product between a n × n skew-symmetric matrix A and a vector b that outperforms the state-of-the-art approaches.
The skew-symmetric matrix-vector product is a bilinear operation that can be represented by a tensor T skew n of size n(n−1) 2 × n × n. To build this tensor, we start from the general matrix-vector multiplication tensor T n,n,1 of size n 2 × n × n. We build the tensor T skew n as follows: first we compute an auxiliary tensor T aux n [i, j, k] = T n,n,1 [i, j, k] − T n,n,1 [i, k, j], and then we remove all the slices of the auxiliary tensor that are linearly dependent, obtaining the n(n−1) 2 × n × n tensor T skew n . An explicit algorithm to build this tensor is presented in Algorithm C.14.
Algorithm C.14: Builds tensor representing the skew-symmetric matrix-vector multiplication of size n. Input: n (the size of both the matrix and the vector). Results. Figure 4a in the main paper shows the decompositions obtained by AlphaTensor. Similarly to the previous use-case of structured matrix multiplication (Appendix C.1), we detect a pattern for small sizes n (color-coded in the illustration), which we generalize and prove for arbitrary n, yielding a general algorithm for the skew-symmetric matrix-vector product (Figure 4b in main paper).
Theorem C.4. The number of multiplications required to multiply a skew-symmetric matrix and a vector of dimension n is at most (n − 1)(n + 2)/2.

Proof. See Section I.3.
This algorithm is strictly better in terms of bilinear complexity compared to previously known algorithms [14] -it uses ∼ 1 2 n 2 multiplications instead of ∼ n 2 . We also note that this discovered algorithm is asymptotically optimal, as 1 2 n 2 matches the asymptotic lower bound obtained by counting the number of different elements in a skewsymmetric matrix. The discovered algorithm, which improves upon state-of-the-art methods, demonstrates the viability of AlphaTensor in discovering new and more efficient algorithms. We believe this methodology can be applied to other bilinear operations, and yield efficient algorithms taking into account the structure of the problem.

Generalizing the factorizations found by AlphaTensor.
Unlike the circulant matrix-vector case, in skew-symmetric matrix-vector multiplication there exist many decompositions of the same rank, even when factoring out permutation and factor symmetries. AlphaTensor indeed found many such valid decompositions that are not just the ones presented in Figure 4a. Nevertheless, we noted that many of these decompositions presented a peculiar structure: for each factor, either u or v is a one-hot vector. Even more interestingly, when u is a one-hot vector whose nonzero entry corresponds to element (i, j) from the matrix A, then the corresponding vector v is non-zero only in the elements i and j. Conversely, when v is a one-hot vector whose nonzero entry corresponds to the i-th element of the vector b, then the corresponding u is non-zero only on the elements corresponding to (j, i) of matrix A for any j in {1, . . . , n} (i.e., the elements of the matrix that get multiplied by the i-th element of the vector b). This insight led us to the hypothesis that we can always decompose the n-skew-symmetric matrix-vector multiplication tensor by using (n + 2)(n − 1)/2 terms of the special forms described above. This is the strategy followed in the proof in Section I. 3, which shows that this is indeed the case.

D Rapid tailored algorithm discovery D.1 Benchmarking details
We use AlphaTensor to find provably correct matrix multiplication algorithms which are tailored to optimize the running time in a particular setting, where the setting includes hardware (e.g., a V100 GPU or TPU), matrix size (e.g., multiplying a pair of 8,192 × 8,192 matrices), and the arithmetic precision (e.g., single precision arithmetic). In these experiments we benchmark algorithms discovered by AlphaTensor on the fly, and use this benchmarking result for the reward function. That is, we set r ′ t = r t +λb t , where r t is the rank reward described in the paper, b t is the benchmarking reward (nonzero only at the terminal state), and λ is a user-specified coefficient. 6 After every game that ends with a correct factorization of the matrix multiplication tensor, an AlphaTensor actor sends a remote request to a benchmarking server equipped with the target hardware (e.g., a V100 GPU). The benchmarking server builds the matrix multiplication algorithm defined by the incoming factorization and then applies just-in-time (JIT) compilation using JAX [16], a Python library for high-performance machine learning research. JAX JIT analyzes the given Python code and optimizes the code for the target hardware. For example, it can decide to fuse a matrix multiplication followed by a matrix addition into a single CUDA GEMM call. JAX also controls the order of execution of operations and can schedule multiple ops to run in parallel. In many cases it is possible to outperform JAX JIT by writing custom CUDA code, but JAX JIT proved to be good enough in our experiments, with the additional benefit of being fully automatic. This allowed us to automatically optimize and benchmark every matrix multiplication algorithm proposed by AlphaTensor. Every actor has a dedicated benchmarking server, but also occasionally uses other benchmarking servers (indicated by dashed lines) to increase the precision for the most promising algorithms.
To implement the benchmarking, we were guided by the information in [17], as well as the Python timeit module documentation [18]. To reduce benchmarking noise, timeit employs three tricks: it runs the function being benchmarked many times (such that the overall running time of the loop is at least 0.2 seconds); it disables Python garbage collection for the duration of benchmarking; and it repeats the loop three times, reporting the minimal timing (to be more robust to spikes in the timing due to interruptions from the OS and other users of the machine).
We followed all these three practices with a single modification. When calling a JAX JIT function, it returns a future instead of the immediate result, scheduling the computation on the GPU. As a result, running the matrix multiplication function multiple times in a loop can potentially schedule the function multiple times and potentially run them in parallel, which can skew the benchmarking results. We therefore opted to compute a sequence of matrix multiplications that all depend on each other in the inner loop, namely six sequential matrix multiplications (A(A(A(A(A(AB)))))) and used the block_until_ready() method on the final output. In this way, the matrix multiplications in the inner loop run sequentially, but the GPU is never blocked by waiting for the CPU to schedule the next multiplication, since scheduling the next multiplication by the CPU can be done in parallel with computing the current multiplication by the GPU.
When implementing this approach, we found that the variance of benchmarking was still large. For example, when using this approach to benchmark an algorithm discovered by AlphaTensor, the estimated speed up against the baseline Strassen 2 algorithm varied from 4% to 11.6% (see Figure D.2). We identified the main reason of this noise to be the GPU frequency scaling (GPU driver can change the clock frequency on the fly). We implemented two additional techniques to improve the stability of the measurements.
First, additionally to measuring the running time of the algorithm of interest, we also measure the running time of the baseline Strassen 2 algorithm and report the ratio between the two. The idea is that since the two measurements are done one after another, the clock frequency is likely to be similar when benchmarking the two algorithms. This makes our rewards comparable even for two factorizations benchmarked hours apart using very different clock frequencies, since each ratio represents the speedup with respect to the baseline computed with (almost) fixed frequency. We repeat this process of benchmarking the target algorithm and the baseline several times (4 in the experiments) and report the median ratio. As seen from numerical experiments, using this additional trick indeed significantly reduces the variance of the measurements (see Figure D.2).
Second, to be even more agnostic to the state of the benchmarking server, we benchmark the most promising algorithms on multiple benchmarking servers and report the median across those. Namely, for algorithms that are better than 75% of the other algorithms seen so far we use the median ratio across 2 benchmarking servers, for algorithms that are better than 90% of the algorithms seen so far we use the median across 5 servers, and for algorithms that are better than 95% of the other algorithms we use the median across 17 servers.
When measuring the running time of an algorithm, we preload the input data (the two block matrices) into the GPU memory and do a series of 3 warm-up runs to make sure the GPU is ready to be used for benchmarking. We do not count the time to transfer the data to and from the GPU memory, as in most cases the GPU is used for more than one matrix multiplication operation in a row and the data transfer time gets amortized. We also do not count the time to split the input matrices into blocks and to construct the resulting matrix from the blocks as it is possible to implement any of the fast matrix multiplication algorithm in C CUDA using memory addressing which would make the block splitting and concatenating essentially free. See benchmarking pseudocode in Figure D  . The two plots correspond to the two different benchmarking schemes discussed in the text: a "simple" scheme based on the Python best practices and "with interleaving baseline" that includes the additional trick discussed above to circumvent the frequency scaling issue. The "exact" speed up is computed as the median of all the speed ups estimated by both methods.
In the AlphaTensor runs, we use Strassen 2 as the baseline algorithm for the benchmarking. However, to report the results, our discovered algorithms are re-benchmarked against standard matrix multiplication, where we use a higher precision. Namely, we used the median ratio across 200 repeats. To further increase the accuracy, we fixed the GPU clock frequency 7 to the maximum possible value (1530 for a V100 GPU). For example, when repeatedly estimating the speed up of Strassen 2 against the standard baseline using this approach, we got estimates from 4.15% to 4.3% with a standard deviation of 0.06 percent points.
For TPUs we follow the exact same benchmarking procedures, with only two exceptions: a) we don't fix the clock frequency (because on TPUs clock is always constant); b) we use matrices of data type bfloat16 (in contrast to float32 used for GPUs), as this is a more common setting among TPU users.

D.2 Experimental setup
We adapt AlphaTensor to work with benchmarking reward as follows. The immediate reward is set to r ′ t = r t + λb t , where r t is the reward described in the paper, and b t is the benchmarking reward, equal to zero in all intermediate states, and equal to b t = time taken to execute baseline algorithm time taken to execute found algorithm − γ, for a terminal state with S t = 0, where γ is a hyper-parameter. As the algorithms found initially are of high rank (and hence not efficient), we initially set λ = 0, and activate the benchmarking reward only in late stages of the training (i.e., when AlphaTensor discovers sufficiently small ranks). In our experiments we used λ = 5.0, γ = 0.8 and Strassen 2 as the baseline algorithm. For synthetic demonstrations, the benchmarking reward is set to 0, as such tensors do not correspond to a matrix multiplication algorithm. Moreover, we only benchmark games played in the canonical basis, as we did not find the use of other basis helpful.
In addition to the value head that predicts the rank of the tensor, we train an auxiliary value head and an auxiliary reward head. The auxiliary reward head is trained to predict b t , while the auxiliary value head is trained to predict the auxiliary return (future auxiliary reward) at any state. The value that is used in MCTS is a linear combination of the main value and the auxiliary value, v = v main + λv auxiliary . We use the rewards predicted by the reward head in MCTS simulations, as benchmarking an algorithm to compute the true environment reward b t is time consuming, and would significantly slow down MCTS. These auxiliary heads are only trained on selfplay games, i.e. we do not train them on demonstrations.
In addition to the usual inputs (described in Methods section) we provide an extra input to the neural network consisting of all previous actions (factors) played to reach the current state (tensor). Unlike the residual tensor that does not contain information about the matrix multiplication algorithm, the factors encode the algorithm and hence provide important information to the auxiliary heads.

D.3 Additional results
We now show that by optimizing algorithms for larger sizes, one can obtain further speed-ups. Our experiments are performed on TPU v2. We note a substantial speed-up when optimizing the algorithm for the intended size. For example, when multiplying matrices of size 16,384, the algorithm optimized for 8,192 yields 11.2% speed-up compared to the standard algorithm, while the AlphaTensor-discovered algorithm optimized for matrices of size 16,384 achieves a 13.5% speed-up.  The baseline Strassen 2 algorithm is also shown. We report the median over 1,000 runs. The standard deviation over runs is < 0.4 percent points.

E Finding border rank with AlphaTensor
AlphaTensor is a general method that supports any tensor decomposition environment and any reward function. One example is AlphaTensor's ability to operate in finite fields, where the factor and tensor entries live in a finite set, and the multiplication and addition arithmetic obeys the rules in finite fields. Another example is AlphaTensor's ability to find efficient algorithms tailored to specific hardware.
To demonstrate the ability of AlphaTensor to decompose tensors in different settings, we show a use-case that departs from the standard low-rank tensor decomposition. Specifically, we aim at finding the border rank of a tensor. The border rank of a tensor T is the minimum integer R such that the decomposition R r=1 u (r) (ε)⊗v (r) (ε)⊗w (r) (ε) converges to T as the scalar ε → 0 for some factors {u (r) (ε), v (r) (ε), w (r) (ε)} that depend on ε. We emphasize that the goal of this section is to showcase the flexibility of AlphaTensor, as opposed to discovering new results.

Methodology.
To find the border rank of a tensor, we modify AlphaTensor and TensorGame as follows.
• Ring. We set the ring E to be the set of Laurent polynomials in ε. Thus, the factor entries are Laurent polynomials in the variable ε, which we constrain to be in the set , there are |F | = 51 possible entries. The addition and multiplication operations are performed according to the standard polynomial arithmetic; that is, the entries of the state tensor S t are Laurent polynomials that are not generally in the set F . • Disallowed actions. We disallow actions (factor triplets) u (r) (ε) ⊗ v (r) (ε) ⊗ w (r) (ε) that lead to either the all-zero tensor or a rank-one tensor whose all entries are O(ε), since these actions do not make any progress towards finding a border rank decomposition. Moreover, for simplicity we also disallow actions that lead to a rank-one tensor containing any Laurent polynomial with any term of degree strictly lesser than −3. • Game end and reward. We terminate the game at step R if either all the terms of the residual tensor T − The reward is −1 for each step taken during the game, plus an additional negative reward equal to the number of terms of the residual tensor that are not O(ε).
• Network input. To form the input of the network at step t, we concatenate 4 tensors obtained from S t . These 4 tensors contain integer-valued entries instead of Laurent polynomials. To form them, we retain the coefficients of S t corresponding to ε d for some fixed value of d. By letting the exponent d take values in {−3, −2, −1, 0}, we obtain the 4 tensors that we concatenate together (the coefficients corresponding to positive degrees d do not have any relevance when ε → 0, so we ignore them). • Action canonicalization. We canonicalize the factor triplets similarly as the standard AlphaTensor. We define the canonical form of u(ε) as the factor u ′ (ε) = λu(ε) (with λ ∈ {−1, +1}) such that the Laurent polynomial corresponding to the first nonzero entry of u ′ (ε) has positive coefficient for the term with smallest degree (e.g., the first nonzero entry is ε −1 − 1 instead of −ε −1 + 1). Moreover, whenever all the entries of a factor triplet (u(ε), v(ε), w(ε)) are polynomials with non-negative exponents of ε, we canonicalize the triplet by setting ε = 0. • Synthetic demonstrations. To generate synthetic demonstrations, we follow the same approach as when optimizing the standard rank: we sample random factors (u (r) (ε), v (r) (ε), w (r) (ε)), and then create the tensor T(ϵ) = R r=1 u (r) (ε)⊗v (r) (ε)⊗w (r) (ε), so that the demonstration is ({(u (r) (ε), v (r) (ε), w (r) (ε))} R r=1 , T(ϵ)). To generate each factor, we independently sample its entries. For each entry, we generate a Laurent polynomial with either one or two non-zero coefficients, with probabilities 0.7 and 0.3, respectively (each coefficient can be either 1 or −1 with equal probability). We assign each coefficient to a term ε d of the polynomial, where d itself is a random variable with p(d = 0) = 0.7, p(d = 1) = p(d = −1) = 0.1, and p(d = 2) = p(d = −2) = 0.05. We reject those factor triplets that do not conform valid actions.
For completeness, we show next one of the border-rank-5 decompositions of T Bini found by AlphaTensor: where the factors are stacked vertically, e.g., Similarly, we show next a border-rank-10 decomposition of T 2,2,3 found by AlphaTensor: AlphaTensor (no ablation) 49 Table 1: Results of repeating the experiment that discovered rank-49 factorizations of the general matrix multiplication tensor T 4 (over standard arithmetic), each time disabling one of the components of AlphaTensor. In the experiment without selfplay we trained AlphaTensor on synthetic demonstrations only, and used the resulting neural network in MCTS seeking to decompose the target tensor T 4 .

F Ablations
Apart from the components ablated in Table 1, another important feature of AlphaTensor is the ability to train a single agent for decomposing multiple target tensors. When working over modular arithmetic Z 2 , an agent trained on all matrix multiplication tensors T n,m,p with n, m, p ≤ 5 discovers a rank-47 factorization of T 4 , whereas an agent trained solely on T 4 only discovers rank-49 factorizations. We performed an analysis on one of the experiments that discovered a rank-47 factorization of T 4 and found that the agent is able to transfer knowledge between the three target tensors T 3,3,4 , T 3,4,4 , and T 4 . Specifically, a new agent trained from scratch but initialized with the discovered solutions to both T 3,3,4 and T 3,4,4 in its demonstrations buffer is very quickly able to rediscover rank-47 factorizations of T 4 . This is not the case when the agent is initialized with solutions to T 3,3,4 only, or with solutions to T 3,4,4 only.

G Hyperparameters
Hyper-parameter Value  The hyper-parameters for specific sub-components are listed where those components are described: the configuration of the network architecture in Appendix A.1, the synthetic demonstration generation procedure in Appendix A.2, and the distribution and sampling of basis changes in Appendix A.3. Table 2 lists the remaining general hyperparameters.

H Combining smaller factorizations into bigger ones
In this section, we describe and extend the divide-and-conquer technique [25,26] that allows one to combine factorizations of smaller matrix multiplication tensors into factorizations of larger matrix multiplication tensors. For example, this technique allows us to obtain a rank-287 factorization of T 5,8,10 (which improves over the previously known rank-291 [25]) by combining a known rank-33 factorization of T 2,4,5 and a rank-47 factorization of T 3,4,5 found by AlphaTensor. We refer to the supplementary data for all such decompositions.

H.1 Illustrative example
Let us describe the divide-and-conquer approach using T 9 as an illustrative example. Specifically, let us find an algorithm for multiplying two 9 × 9 matrices A and B, a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19  a 21 a 22 a 23 a 24 a 25 a 26 a 27 a 28 a 29  a 31 a 32 a 33 a 34 a 35 a 36 a 37 a 38 a 39  a 41 a 42 a 43 a 44 a 45 a 46 a 47 a 48 a 49  a 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 a 59  a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 a 69  a 71 a 72 a 73 a 74 a 75 a 76 a 77 a 78 a 79  a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 a 89  a 91 a 92 a 93 a 94 a 95 a 96 a 97 a 98 To apply the divide-and-conquer approach, we need a high-level matrix multiplication algorithm, i.e., a factorization of some matrix multiplication tensor. The choice of the tensor and its factorization affects the resulting number of factors of the decomposition of T 9 . Hence, in practice, it is beneficial to consider all available choices (i.e., all available sizes and algorithms), and pick the one leading to the best result. To make the exposition concrete, let us consider T 2,3,3 . (This is a generalization of the scheme described in [25], which only considered factorizations of T 2 as the high-level matrix multiplication algorithm.) Consider the following high-level algorithm representing a rank-15 decomposition of T 2,3,3 (this particular decomposition is discovered by AlphaTensor): We can apply the high-level algorithm for multiplying block matrices, Let us fill the block matrices with the elements of the original matrices A and B (by padding with zeroes to match the sizes 8 ). For example, one can use the following scheme:  a 12 a 13 a 14 a 15 a 16 a 17 a 18 51 a 52 a 53 a 54 a 55 a 56 a 57 a 58 a 59  a 61 a 62 a 63 a 64 a 65 a 66 a 67 a 68 a 69  a 71 a 72 a 73 a 74 a 75 a 76 a 77 a 78 a 79  a 81 a 82 a 83 a 84 a 85 a 86 a 87 a 88 a 89  a 91 a 92 a 93 a 94 a 95 a 96 a 97 a 98 There are multiple ways in which the elements of A, B, and C can be placed into the block matrices above, depending on the specific structure of the zero padding. Again, this choice affects the resulting number of factors in the decomposition of T 9 , as described below. In practice, we try all combinations and pick the best one.
We now analyze the steps of the high-level algorithm when applied to the blocks of the matrices in Eq. 3.
Step 1, h 1 = A 23 (B 21 − B 23 − B 31 ), involves multiplying a 6 × 3 matrix by a 3 × 3 matrix, and thus can be done by using the algorithm T 6,3,3 (which involves 40 multiplications). Steps 2, 3, and 4 are similar. However, step 5 is different: it involves A 12 , which is of size 6 × 3; however half of its rows are zero, so we can use T 3 (for which we have a rank-23 decomposition) for the computations in this step. We can also see that the sparsity pattern of the block matrix C also affects the resulting number of terms in the decomposition. For example, step 10 of the high-level algorithm only involves dense matrix multiplications and thus one might think that it can only be done via T 6,3,3 . However, h 10 is only used later for computing C 11 , whose last three rows are zero, so there is no need to compute the last three rows of h 10 , and thus it can be done via T 3 .
Exploiting the sparsity patterns in all 15 steps, we obtain that the final rank of the decomposition is 498 (this decomposition involves 6 factorizations of T 3 , and 9 factorizations of T 6,3,3 ). We refer to the accompanying supplementary data for all the decompositions.

H.2 General approach
For each target T n,m,p with n, m, p ∈ {3, . . . , 12}, we try all the factorizations discovered by AlphaTensor as highlevel factorization, and all non-equivalent ways of placing the original matrix elements into the blocks of the high-level matrices (i.e., the sparsity patterns). Since the results depend on the particular high-level factorization, we are able to leverage the diversity of the decompositions found by AlphaTensor.
This generalized divide-and-conquer approach subsumes adding and multiplying factorizations. Indeed, by using the trivial decomposition of T 1,1,2 as the high-level algorithm, one can sum factorizations of T n,m,p and T n,m,p ′ into a factorization of T n,m,p+p ′ . Similarly, by dividing the original matrices A, B and C into blocks without introducing Proof. By assumption we have X p ⊗ Y p ⊗ Z p = (AT π (F π(1) p )B −1 ) ⊗ (BT π (F π(2) p )C −1 ) ⊗ (CT π (F π(3) p )A −1 ), Multiplying these two identities, we obtain where we used the identities (AC) ⊗ K (BD) = (A ⊗ K B)(C ⊗ K D) and (B ⊗ K B ′ ) −1 = (B −1 ⊗ K B ′−1 ).

Lemma I.2.
Let T be a tensor with nonnegative entries and of rank R > 0, and let {( Note that the denominator is nonzero, as otherwise the tensor T is equal to zero. Similarly, we have .