Introduction

In the last decade, there have grown a number of applications of neural networks and machine learning. Based on the success of the neural network and machine learning applications in the real-world setting, there has also been an increase in the complexity level of the mathematical models. These models are represented by parameters in large matrices, such as weight matrices in the neural network. Suppressing the memory size of the model is a rising issue for implementing those applications in the real-world, especially in edge computing.

One of the authors proposed a lossy decomposition scheme of such matrices into a product of an integer matrix and a real matrix (hereinafter referred to as integer decomposition)1. The concept of this matrix decomposition is that the integer matrix has a smaller memory footprint compared with the real matrix in floating-point representation. The integer variable uses only one or two bits for binary or ternary representation, whereas the floating-point variable uses 32 or 64 bits. Thus, the compression rate is approximated by the ratio of the sizes of the original matrix and the real matrix. For a specific machine learning task, the memory footprint is reduced to 1/3, and the performance is 36.9 times faster than the original one with 1.5% increase in loss of accuracy.

This process of matrix compression optimises the integer and the real matrices simultaneously. The problem is a mixed-integer non-linear programming (MINLP) problem, an NP-hard problem. In addition, the original proposal includes a strategy in a greedy manner to reduce optimisation complexity; however, the greedy method does not reach the best solution in general.

In this paper, we improve the procedure of the integer decomposition through a recently developed black-box optimisation (BBO) technique for binary variables2,3,4. These methods use a data set of binary input and real-value output to produce a surrogate model, i.e. a pseudo-Boolean function, and optimise the model using Ising solvers. The tentative optimal solution from the Ising solver can be evaluated by the black-box function, e.g. conducting experiments, simulations or other methods depending on the problem. In the matrix decomposition, we transform our MINLP problem into a non-linear integer programming (NLIP) problem. Although we still have an explicit form of the cost function with integer variables, we can consider this function as a black-box function. Although this paper demonstrates a binary variable problem, it is straightforward to handle an integer variable with binary variables.

Data acquisition is not a hard task thanks to the explicit form of the function. Thus, the number of iterations and evaluations of the cost function is not a big deal in this case. We will conduct \(O(n^2)\) iterations for BBO, which is a unique feature of this study from a typical case with an expensive-to-evaluate function.

Several tasks on lossy data compression have been commonly used in image and audio data processing, such as JPEG and MP3. Their algorithms are based on human limitations of frequency and time domains. General-purpose algorithms, such as low-rank approximation, non-negative matrix factorisation and non-negative/binary matrix factorisation algorithms, have been proposed5,6,7. These methods are not applicable for the integer decomposition because they target the decomposition by real or non-negative matrix.

Recently, a general matrix decomposition method has been proposed8. The proposed method optimises integer matrix and real matrix separately, while our algorithm does simultaneously. In addition, Thompson sampling of the MINLP surrogate model is studied with commercial mixed-integer programming (MIP) solvers9. Our algorithm provides solutions to MIP problems using Ising solvers, whereas such commercial solvers do not disclose details of algorithms.

A number of studies on the applications of the BBO of binary variables with Ising solvers are carried out, including aero-structural problem2, nanomaterials4, spin glass10, commonality optimisation11 and chemical structure optimisation12,13. This paper not only introduces another application but also proposes a novel algorithm for solving MINLP problems by the BBO using the Ising solver.

This paper is organised as follows. The “Integer decomposition” section defines the integer decomposition and derives a black-box formulation of the problem. Variations of BBO algorithms and Ising solvers tested in this study are described in the “Black-box optimisation” section. The “Results” section dedicates to the results of these algorithms and solvers. The final section summarises and discusses the results.

Integer decomposition

Original formulation of the integer decomposition

The integer decomposition approximates a target matrix \(\varvec{W}\) by a product of binary (or ternary) matrix \(\varvec{M}\) and real matrix \(\varvec{C}\),

$$\begin{aligned} \varvec{W} \sim \varvec{V} = \varvec{MC}. \end{aligned}$$
(1)

The matrix sizes of \(\varvec{W}\), \(\varvec{M}\) and \(\varvec{C}\) are \(N \times D\), \(N \times K\) and \(K \times D\), respectively. This decomposition is parameterised by K, which controls the approximation accuracy.

If \(K = N\) (no compression), \(\varvec{M}\) and \(\varvec{C}\) can be \(\varvec{I}\) and \(\varvec{W}\),

$$\begin{aligned} \varvec{V} = \varvec{M}\varvec{C} = \varvec{I}\varvec{W} = \varvec{W}. \end{aligned}$$
(2)

In the matrix compression, K is smaller than N. Thus, we cannot reproduce \(\varvec{W}\) by \(\varvec{M}\) and \(\varvec{C}\). Then, the optimal binary and real matrices can be obtained through minimisation of the cost function expressed by \(L_2\) matrix norm,

$$\begin{aligned} \mathop {\mathrm{arg\, min}}\limits _{\begin{array}{c} \varvec{M} \in \{-1,1\}^{N \times K}\\ \varvec{C} \in \mathbb {R}^{K \times D} \end{array}} \left\| \varvec{W}-\varvec{MC}\right\| _2^2. \end{aligned}$$
(3)

This is an MINLP problem; therefore, there is no general algorithm to efficiently find the exact solution.

The original algorithm finds the decomposition as a series of products,

$$\begin{aligned} \varvec{V} = \sum _i^K {\varvec{m}_i\varvec{c}_i^\mathsf {T}}, \end{aligned}$$
(4)

where \(\varvec{m}_i\) and \(\varvec{c}_i\) are N and D dimensional vectors and calculated step by step from \(i=1\) to \(i=K\). The i-th vectors \(\varvec{m}_i\) and \(\varvec{c}_i\) are optimised by using pre-optimised vectors \(\varvec{m}_j\) and \(\varvec{c}_j\) \((j=1,\ldots ,i-1)\),

$$\begin{aligned} \mathop {\mathrm{arg\, min}}\limits _{\begin{array}{c} \varvec{m}_i \in \{-1,1\}^N\\ \varvec{c}_i \in \mathbb {R}^D \end{array}} \left\| \left( \varvec{W}-\sum _j^{i-1}{\varvec{m}_j\varvec{c}_j^\mathsf {T}}\right) -\varvec{m}_i\varvec{c}_i^\mathsf {T}\right\| _2^2. \end{aligned}$$
(5)

The search space is drastically reduced from NK- to N-dimension, although convergence to the exact solution is not guaranteed. In each step, this algorithm finds the best rank-one approximation of the residual from the previous approximation. Therefore, updating the variables that are fixed previously is not possible, i.e. it cannot escape from local minima.

As shown in Eq. (4), the matrix decomposition has two types of arbitrariness. The first one is the order of columns (or rows) in the binary matrix \(\varvec{M}\) (the real matrix \(\varvec{C}\)) indexed by i, that is, the order of sum in Eq. (4). The second one is the sign of each column in \(\varvec{M}\) (i.e. \(\varvec{m}_i, \varvec{c}_i \rightarrow -\varvec{m}_i, -\varvec{c}_i\)). Consequently, the total number of equivalent matrices is \(K! \times 2^K\). In the case of \(K=3\), there are 48 exact solutions of Eq. (3). Figure 5 in the appendix shows examples of the 48 solutions for a specific instance.

Black-box formulation of the integer decomposition

As shown in the previous subsection, the integer decomposition is an MINLP problem. To solve this problem by BBO algorithms, we convert it to an NLIP problem. If given \(\varvec{M}\) has linearly independent columns, we can calculate \(\varvec{C}\) using the least-squares method in matrix form,

$$\begin{aligned} \varvec{C} = \left( \varvec{M}^\mathsf {T}\varvec{M}\right) ^{-1}\varvec{M}^\mathsf {T}\varvec{W}, \end{aligned}$$
(6)

where \(\left( \varvec{M}^\mathsf {T}\varvec{M}\right) ^{-1}\varvec{M}^\mathsf {T}\) refers pseudoinverse, and thus the approximated matrix \(\varvec{V}\) is a function of \(\varvec{M}\)

$$\begin{aligned} \varvec{V}(\varvec{M}, \varvec{C}) = \varvec{M}\varvec{C} = \varvec{M}\left( \varvec{M}^\mathsf {T}\varvec{M}\right) ^{-1}\varvec{M}^\mathsf {T}\varvec{W} = \varvec{V}(\varvec{M}). \end{aligned}$$
(7)

Substituting this equation to Eq. (3), we have

$$\begin{aligned} \mathop {\mathrm{arg\, min}}\limits _{\varvec{M} \in \{-1,1\}^{NK}} \left\| f(\varvec{M}) \right\| _2^2, \end{aligned}$$
(8)

where

$$\begin{aligned} f(\varvec{M}) = \varvec{W} - \varvec{M}\varvec{C} = \varvec{W} - \varvec{M}\left( \varvec{M}^\mathsf {T}\varvec{M}\right) ^{-1}\varvec{M}^\mathsf {T}\varvec{W}. \end{aligned}$$
(9)

Now, we remove real-value parameters and have an NLIP problem. Note that the Taylor series of this cost function has infinite terms, and thus Ising solvers, including quantum annealer, cannot solve this optimisation problem directly. On the other hand, BBO can solve this problem if we deal with the input-output relationship of the cost function \(\left\| f(\varvec{M}) \right\| _2^2\) as a black-box function. The optimisation algorithm employs not the explicit form of the function but a specific data set of the input-output relationship calculated from the function.

Black-box optimisation

BBO algorithms

Ising solvers find a solution of a quadratic function so we can approximate the data set \((\varvec{x}, y)\) by the quadratic function \(\hat{y}(\varvec{x}) = \varvec{x}^\mathsf {T}\varvec{A}\varvec{x} + \varvec{b}^\mathsf {T}\varvec{x} + c\), where \(\varvec{x} \in \{-1,1\}^{NK}\) and \(\varvec{A}\), \(\varvec{b}\) and c are model parameters. Note that we use \(\varvec{x}\) in this surrogate model instead of \(\varvec{M}\), y as the cost associated with a given \(\varvec{x}\) and \(\hat{y}\) as the approximated cost by the surrogate model. This function can be expressed in a simplified quadratic form \(\varvec{x}^{\mathsf {T}}\varvec{A}\varvec{x}\) if we include an additional dimension in the vector \(\varvec{x}\) as \((x_1, \ldots , x_{NK}, 1)\). The different algorithm handles the surrogate model differently.

Bayesian optimisation of combinatorial structures (BOCS)2 treats the parameter \(\varvec{A}\) in Bayesian linear regression. The authors proposed to use the horseshoe prior14,

$$\begin{aligned} \alpha _k | \beta ^2_k, \tau ^2, \sigma ^2\sim & {} N(0,\beta ^2_k\tau ^2\sigma ^2) \nonumber \\ \tau , \beta _k\sim & {} C^+(0,1) \nonumber \\ P(\sigma ^2)= & {} \sigma ^{-2}, \end{aligned}$$
(10)

where \(\alpha _k\) stands for the coefficient of the k-th variable in the linear regression and \(C^+(0,1)\) is a half-Cauchy distribution. Note that as the surrogate model is linear, second-order terms \(x_i x_j\) are treated as independent explanatory variables, i.e. \((x_1, \ldots , x_n, x_1x_2, x_1x_3, \ldots , x_{n-1}x_n)\), where \(n = NK\). Thus, the index k runs from 1 to \(n+n(n-1)/2\). As the parameter \(\varvec{A}\) of the surrogate model is a distribution in BOCS, a specific value of \(\varvec{A}\) is chosen based on the distribution inspired by Thompson sampling15. In addition to the horseshoe prior, normal prior \(\alpha _k \sim N(0, \sigma ^2)\) and normal-gamma prior \(\alpha _k, \sigma ^{-2} \sim \text {NormalGamma}(0, 1, 1, \beta )\) are tested.

The sampling from the horseshoe distribution is performed using the Monte Carlo sampling14, which requires a longer execution time compared to the normal and normal-gamma distributions. Samplings of the variables from these distribution functions are accelerated using fast Gaussian samplers16,17.

Factorisation machine with quantum annealing (FMQA)4 utilises the factorisation machine (FM)18 as the surrogate model. The surrogate model of degree \(d = 2\) is defined as

$$\begin{aligned} \hat{y}(\varvec{x}) := w_0 + \sum _{i=1}^nw_i x_i + \sum _{i=1}^n\sum _{i=i+1}^n\langle \varvec{v}_i,\varvec{v}_j\rangle x_i x_j, \end{aligned}$$
(11)

where \(\langle \cdot ,\cdot \rangle\) represents the dot product of two vectors of size \(k_\text {FM}\),

$$\begin{aligned} \langle \varvec{v}_i,\varvec{v}_j\rangle := \sum _{l=1}^{k_\text {FM}} v_{i,l} v_{j,l}. \end{aligned}$$
(12)

The horseshoe prior and FM introduce sparsity in the surrogate model, whereas normal and normal-gamma priors do not. The results section shows the effects of this sparsity on the performances of algorithms. Each algorithm has its hyperparameter(s) to be fixed before conducting the BBO, i.e. the variance \(\sigma ^2\) in the normal prior, the shape parameter \(\alpha (=1)\) and the inverse scale parameter \(\beta\) in the normal-gamma prior and the size parameter \(k_\text {FM}\) in the FM, while the horseshoe prior has no hyperparameters. Hyperparameters \(\sigma ^2\) and \(\beta\) are optimised for a specific instance. Then, the optimal values are applied to other instances. We do not optimise the size parameter \(k_\text {FM}\) but choose eight from FMQA’s proposal as well as 12 to have enough degree of freedom to represent \(\alpha _k\).

We refer to the vanilla BOCS as vBOCS, the normal prior BOCS as nBOCS and the normal-gamma BOCS as gBOCS. The differences in \(k_\text {FM}\) are identified by FMQA08 and FMQA12. In addition, a random search algorithm is referred to as RS, in which each vector \(\varvec{x}\) is randomly sampled without utilising the pre-obtained data set.

Ising solvers

BOCS and FMQA use an Ising solver to find the solution of the surrogate model represented in a quadratic form. We evaluate three Ising solvers: simulated annealing, QA and simulated quenching. Simulated annealing (SA) introduces a thermal fluctuation in the exploring process of the cost function19. It is implemented on the Monte Carlo simulation. The temperature in the simulation is a parameter that controls the probability of variable reversal such that the cost function increases. Initially, the temperature is high to help global search in the solution space and find the candidate basin harbouring the global minimum. Later, the temperature becomes low to find the lowest solution in the basin. If the cooling schedule is slow enough, i.e. \(\propto 1/\log (t)\), SA finds the global minimum20.

Quantum annealing (QA) takes place in the thermal fluctuation in SA by quantum fluctuation through a scheduled transverse field. The quantum system starts from a trivial state (superposition of all solutions) and finds the ground state of the cost function at the end of the annealing21. If the transverse field is scheduled as \(\propto t^{-1/(2N-1)}\), the system converges to the ground state at the end22.

We refer to simulated quenching (SQ) as a variation of SA with extremely rapid quenching of the temperature from high to zero. Although this algorithm simplifies and accelerates the Monte Carlo calculation, it eliminates the ability to search solution space globally at the early stage of annealing. Thus, this algorithm tends to be trapped in local minima more frequently than SA and QA.

We utilise SA and QA solvers in D-Wave Ocean SDK with default parameters. The default initial and final temperatures for SA are determined from approximately estimated maximum and minimum effective fields with scaling factors 2.9 and 0.4, respectively. The default annealing time for QA is 20 microseconds. In SQ, the temperature is not annealed but kept constant at 0.1. We optimise the surrogate model 10 times in each iteration step for all solvers to find a better solution.

Results

Integer decomposition was conducted for (\(N \times D =\)) \(8 \times 100\) matrix \(\varvec{W}\), which is constructed by shrinking from the final fully connected layer of VGG16 convolutional neural network23. We choose the decomposition parameter \(K = 3\), and thus the matrix \(\varvec{W}\) is decomposed into a \(8 \times 3\) binary matrix \(\varvec{M}\) and a \(3 \times 100\) real matrix \(\varvec{C}\). As discussed in the “Original formulation of the integer decomposition” subsection, the size n of the problem is determined by the size of the binary matrix. In this case, \(n = 8 \times 3 = 24\). The dimension of the model parameter \(\alpha _k\) is \(O(n^2)\) so that we start from the initial data set of size n and then add \(2n^2\) data points iteratively to reach enough size for estimating the model parameter. In the current analysis (\(n = 24\)), we generate initial 24 data points followed by 1152 iterations (1176 in total). This number is remarkably small compared to the solution space \(2^{24} \sim\) \(1.7\times 10^7\). All the algorithms we test in this paper are randomised algorithms and/or starting from a randomly selected data set, and thus we conduct 25 runs (or 100 runs for RS) to estimate the average performance. The performance also depends on the matrix \(\varvec{W}\). Therefore, we generate ten individual problem matrices (instances) to evaluate the performance over instances.

Figure 1 shows the results from different algorithms for the first instance. In the BBO process, SA is used as an Ising solver. The x and y axes represent the iteration step, the size of acquired data (including initial data) in linear scale and the residual error from the exact solution in log scale, respectively. Due to the limitation of the problem size n we tested, the compression quality is insufficient for typical applications. In this situation, the absolute error is not an appropriate measure to compare algorithms. Thus, we employ the residual error \(\left( \left\| f(\varvec{M}) \right\| _2 - \left\| f(\varvec{M}^*) \right\| _2\right) / \left\| W \right\| _2\), where \(\varvec{M}^*\) is the exact solution of Eq. (8). With this measure, we can evaluate how the solutions are close to the exact solution. The exact and the second-best solutions were separately obtained from brute-force search. The absolute error of the exact solution \(\left\| f(\varvec{M}^*) \right\| _2 / \left\| W \right\| _2\), subtracted in the plot, is 0.461. All algorithms outperform the original approximated solution in the red-dotted line. The FMQA algorithm improves its solution faster than other algorithms at the early stage. However, the improvement does not continue at the later stage. vBOCS and nBOCS improve slowly, but the improvement continues during the process. They go under the line of the second-best solution (in grey-dotted line), suggesting that some individual runs find the exact solution.

Figure 1
figure 1

The residual error of lossy compression of a matrix (instance) as a function of iteration step among various algorithms with 95% confidence intervals. The red- and grey-dotted lines indicate the residual error of the original algorithm and the second-best solution by brute-force search.

Figure 2 shows the differences between Ising solvers (SA, QA and QS) applied to nBOCS, labelled as nBOCS, nBOCSqa and nBOCSsq, respectively. Although SQ generally has a poor performance in finding a global minimum in a complex landscape of the cost function, there are no clear differences among Ising solvers. This finding suggests that the landscape of the surrogate model is simple. Thus, such a simple Ising solver is enough for the current BBO task.

Figure 2
figure 2

The residual error of lossy compression of a matrix as a function of iteration step among the Ising solvers (SA, QA and SQ).

As discussed in “Original formulation of the integer decomposition” subsection, there are \(K! \times 2^K\) exact solutions in the optimisation problem. In other words, any \(\varvec{x}\) has \(K! \times 2^K - 1\) equivalent vectors, which give the same value of the cost function. These equivalent vectors can be added to the data set and may accelerate the BBO process. Figure 3 shows the results of this data augmentation. The data augmentation (nBOCSa) does not require the additional calculation of the cost function; thus, we do not change the scale of the x-axis. The results clearly show that the data augmentation negatively affects the performance in the later stage, while there is a little advantage in the early stage.

Figure 3
figure 3

The residual error of lossy compression of a matrix for RS and nBOCS variants with and without data augmentation (nBOCSa and nBOCS).

Table 1 shows the total count of finding exact solutions among all ten instances for all algorithms and their variations. Algorithms of nBOCS with SA, QA and SQ perform better than others. RS and nBOCSa cannot find the exact solution in most instances.

Table 1 Counts of finding exact soulution per 25 runs.

To investigate the behaviour of the algorithms, we perform a cluster analysis of the candidate solution for each step. If an algorithm gradually focuses on subspace harbouring a specific exact solution, sampling of candidate solutions will also be biased gradually. We divide the solution space into four domains based on the hierarchical clustering of the exact solutions. Other solutions in the solution space are assigned to the closest exact solution measured using the Hamming distance and then grouped into the four domains. The population of the four domains reflects the sampling bias. Figure 4 shows the population of the candidate solutions among four domains for individual five runs. The plots are convolved for smoothing (window size, 100). RS and nBOCSa show no trend, while FMQAs start focusing on a domain from an early stage. Other BOCS variants select a domain in the middle of the analysis but continue to explore other domains.

Figure 4
figure 4

The population of the four domains for five individual runs for various algorithms. The x-axis represents the iteration step.

Execution times for all algorithms are shown in Table 2. nBOCSqa takes a longer time than nBOCS due to the overhead to prepare the data matrix to be uploaded to the quantum annealer. The communication time between CPU and QPU is not taken into account. nBOCS is 129 and 67 times faster than vBOCS and FMQA08. The original algorithm and the brute-force search execution times are 0.00096 and 5553.51 s, respectively.

Table 2 Average execution time (s) per run.

Discussion

We conducted BBO of the integer decomposition, a lossy compression of a matrix. Here, we demonstrated that this MINLP problem is transformed into an NLIP problem. The cost function is pseudo-Boolean because the exploratory variables are binary. We employed BBO algorithms such as BOCS and FMQA to the cost function. These algorithms optimise a surrogate model constructed from the input-output data of the cost function to obtain the next candidate for data acquisition and update the model. This transformation can be generalised to solve MIP problems if the cost function is linear in terms of the real variables.

Among the variations of BOCS and FMQA algorithms, BOCS with normal prior showed the best performance, with an execution time that is one or two digits faster than FMQA or BOCS with horseshoe prior. FMQA improves the solution rapidly at the initial stage, while BOCS slowly but continuously improves and obtains a superior solution. The differences in finding the solution space for these algorithms were analysed through clustering. FMQA tends to focus on a subspace earlier than BOCS. However, once the algorithm is trapped in a local minimum, the algorithm cannot escape from the local minimum because FMQA is deterministic. As BOCS is a randomised algorithm, it takes steps but explores larger space and finds a better solution. In the integer decomposition, the cost function is not an expensive-to-evaluate function; thus, we can conduct enough iterations. The strength of BOCS in the late stage of the iteration is favourable compared to FMQA. A randomised version of FMQA24 should be studied in the future.

In the current formulation, the solution space is divided into equivalent \(K! \times 2^K\) subspaces due to the nature of the problem, where each subspace has the exact solution. We confirmed that the data augmentation for the subspaces did not improve the performance. BOCS and FMQA approximate the cost function in the quadratic function, which means the surrogate model fits well locally, and not globally. Therefore, these algorithms try to focus their sampling on a subspace that harbours one of the exact solutions to the problem. However, data augmentation deals with all the subspaces equally. Thus, it is impossible to sample from biased solution space to improve the model locally around the exact solution. Bias is introduced randomly from the selection of initial data and the following biased candidate selections.

With regard to the choice of Ising solvers as a back-end of a black-box optimiser, there is no significant difference between SA, QA and SQ. This finding is non-trivial because SQ has a poor performance in general. SQ does not explore the solution space globally because the algorithm only accepts the next solution that lowers the cost. This fact leads to the following hypothesis. In the early stage of exploration, the low-quality solutions from the Ising solver are enough to construct a more accurate surrogate model. Then, in the late stage, the surrogate model approximates the cost function around one of the exact solutions as the sampling is biased. Optimisation of the surrogate model by the Ising solver might be relatively easy compared to the explicit form of the original cost function. Therefore, the advantage of QA in BBO is an open question. Further studies are needed to reveal how the approximated landscape of the cost function learned from the acquired data set changes according to the increase of the data.

While the original algorithm cannot find the exact solution, the proposed BBO-based algorithm can find it with a certain probability in the tested problem size (\(n=24\)). However, the execution time of the proposed algorithm takes five digits longer than the original. (Brute-force search needs additional two digits.) As the surrogate model is quadratic, BBO needs \(O(n^2)\) iterations to estimate the model parameters. If we employ nBOCSsq (normal prior BOCS with SQ), the most expensive calculation in each iteration is matrix inversion \(O(n^3)\) for building the surrogate model. Therefore, the algorithm takes \(O(n^5)\) calculation time. This scaling may worsen if we choose other Ising solvers (SA and QA). The proposed algorithm has an advantage in solution quality against the original algorithm and calculation time against the brute-force search. However, with the current scaling, the typical use of matrix compression, such as weight matrix in a machine learning task, is not applicable. Figure 4 of the reference4 shows that FMQA finds meaningful solutions for \(n = 12 \sim 50\). Although they fixed the iteration at 2000, more iterations seem needed for large problems. Further investigation is required to accelerate the calculation for handling the typical size of matrices.

In this paper, the algorithm’s limitation is that the model is approximated by quadratic form. The COMBO algorithm considers higher-order terms by diffusion kernel of graph representation25. Kernel-based algorithms relaxing the binary variables to the continuous ones are also proposed26,27. They may work better for problems where the kernel removes essential difficulties in combinatorial optimisation. Once the gradient can be calculated, construction and optimisation of the surrogate model can be accelerated28. Comparison with these algorithms will be a future task.

Methods

Shrunk VGG matrix

To test BBO for lossy matrix compression, we prepare matrices by shrinking the VGG matrix, the weight matrix in the convolutional neural network for image recognition. We choose the matrix of the final fully connected layer (\(4096 \times 1000\) matrix). As the matrix is too large to conduct BBO, we reduce the size by keeping the structural information of the matrix as follows: the weight matrix \(\varvec{w}_0\) is decomposed by singular value decomposition,

$$\begin{aligned} \varvec{W}_0 = \varvec{U} \varvec{\Sigma } \varvec{V}^\mathsf {T}, \end{aligned}$$
(13)

where \(\varvec{U}\), \(\varvec{\Sigma }\) and \(\varvec{V}\) are \(4,096 \times 4,096\), \(4096 \times 1000\) and \(1000 \times 1000\) matrices, respectively. Then, we choose eight and a hundred rows/columns from \(\varvec{U}\) and \(\varvec{V}\), respectively, and eight singular values from \(\varvec{\Sigma }\) to construct the shrunk \(8 \times 100\) matrix.

Exact solutions

The exact solutions in Fig. 5a are obtained by solving Eq. (8). For the current problem size, we can perform a brute-force search for all candidates in the solution space \(2^{24}\). Each solution is presented in a box of \(8 \times 3\) pixels. Black and white pixels represent 1 and \(-1\), respectively. Figure 5b shows clustering results of the 48 exact solutions by the Ward method. Similar solutions are grouped, e.g. the second and fourth boxes (labelled 1 and 3 in the cluster) are one of the closest pairs. We can make four groups by choosing an appropriate cut-off value (the height of the tree). These four groups are used for colouring in Fig. 4.

Figure 5
figure 5

(a) 48 exact solutions for the first instance. Each box represents a \(3 \times 8\) matrix \(\varvec{M}\). Black and white pixels represent 1 and \(-1\), respectively. (b) clustering results of the 48 exact solutions by the Ward method. Each solution is labelled from 0 to 47 and assigned left-to-right and top-to-bottom in (a).

Hyperparameter optimisation

Hyperparameters \(\sigma ^2\) (variance) for nBOCS and \(\beta\) (inverse scale) for gBOCS are optimised for the first instance. For each hyperparameter, grid search 0.0001, 0.001, 0.01, 0.1, 1, 10 and 0.0001, 0.001, 0.01, 0.1, 1, 10, 100 are conducted (Fig. 6). In nBOCS, we select \(\sigma ^2 = 0.1\), which gives the lowest cost. On the other hand, in gBOCS, we select not \(\beta = 10\) but \(\beta = 0.001\), as broader prior distribution is preferable for exploring an accurate model, and there is less cost variation among different hyperparameter values.

Figure 6
figure 6

Hyperparameter dependence on the cost.

Results of the other nine instances

Figure 7 shows the results of the other 9 instances used in Table 1.

Figure 7
figure 7

The residual error of lossy compression of the other nine matrices (instances) as a function of the iteration step among various algorithms with 95% confidence intervals. The errors of the nine exact solutions (baseline of the plots) are 0.535, 0.388, 0.509, 0.487, 0.488, 0.379, 0.367, 0.422, and 0.520, respectively.