Data-driven discovery of dimensionless numbers and governing laws from scarce measurements

Dimensionless numbers and scaling laws provide elegant insights into the characteristic properties of physical systems. Classical dimensional analysis and similitude theory fail to identify a set of unique dimensionless numbers for a highly multi-variable system with incomplete governing equations. This paper introduces a mechanistic data-driven approach that embeds the principle of dimensional invariance into a two-level machine learning scheme to automatically discover dominant dimensionless numbers and governing laws (including scaling laws and differential equations) from scarce measurement data. The proposed methodology, called dimensionless learning, is a physics-based dimension reduction technique. It can reduce high-dimensional parameter spaces to descriptions involving only a few physically interpretable dimensionless parameters, greatly simplifying complex process design and system optimization. We demonstrate the algorithm by solving several challenging engineering problems with noisy experimental measurements (not synthetic data) collected from the literature. Examples include turbulent Rayleigh-Bénard convection, vapor depression dynamics in laser melting of metals, and porosity formation in 3D printing. Lastly, we show that the proposed approach can identify dimensionally homogeneous differential equations with dimensionless number(s) by leveraging sparsity-promoting techniques.

We can construct a dimension matrix D for n independent variables: where v(p i ) is the dimension vector for the i-th independent variable with a length of r. The shape of dimension matrix D is r × n. For example, if the independent variables of a system are related to four dimensions (length, mass, time, and temperature) and the i-th independent variable p i is the velocity (its dimension is length per time), then the corresponding dimension vector v(p i ) is [1, 0, −1, 0] T .
Based on classical dimensional analysis, Eqn. 1 can be rearranged in a dimensionless form: According to Buckingham's Pi theorem [1], the number of dimensionless numbers k is equal to the number of variables n minus the number of dimensions r if the rank of the dimension matrix D is full. Although Buckingham's Pi theorem states a sufficient number of dimensionless numbers to describe a system, it does not state the necessary number of dimensionless numbers. Therefore, the proposed dimensionless learning aims to find m dominant dimensionless numbers from experimental data: where m is less than or equal to k, f (·) is the best representation (i.e, scaling law) of the system, Π is the output dimensionless number, and {Π 1 , Π 2 , ..., Π m } are the input dimensionless numbers. Specifically, to find the j-th dimensionless number Π j , we need to find a vector w j : Therefore, finding m input dimensionless numbers is equivalent to finding {w j } m j=1 . Dimensionless learning aims to find the {w j } m j=1 for input dimensionless numbers and the w for output dimensionless number that satisfies two physical constraints. The first constraint is the principle of dimensional invariance: Another constraint is that scaling law Eqn. 4 should have the best fit to a series of data points in the collected dataset.

Construct dimensionless numbers
The number of dimensions r is typically smaller than the number of independent variables n because there are only 7 fundamental dimensions [2] but a large number of variables of interest. Therefore, Eqn. 6 typically has infinite solutions w j as the dimension matrix D is underdetermined [3].
To solve Eqn. 6, we can first find the K basis vectors w bk of the solution w: where {w b1 , w b2 , ..., w bK } are linearly independent, and K is the number of basis vectors, which is equal to n minus rank(D).
To identify the basis vectors {w bi } K for Dw = D K i=1 γ i w bi = 0, we first need to rearrange the dimension matrix D in reduced row echelon form using Gaussian elimination. Then, we set each free variable to 1 and all others to 0 and substitute them to solve the system and obtain a basis vector. The number of free variables is equal to K.
We can represent w j with a linear combination of basis vectors {w bk } K k=1 . The Eqn. 6 can be written as: Once the basis coefficients γ j have been found, the j-th dimensionless number Π j for input variables can be obtained.
Similarly, the output dimensionless number Π for the dependent variable can be obtained. The only difference is that when constructing the output dimension matrix D , we must account for the dimension vector v(p) of the dependent variable p: To find the output dimensionless number Π, we need to solve the following equation: where γ i is the i-th output basis coefficient, w bi is the i-th output basis vector, K is the number of output basis vectors, and w is a linear combination of {w bi } K i=1 . Once obtaining the basis coefficients γ for the output dependent variable p, we can have the corresponding output dimensionless number Π.

Determining the number of dimensionless numbers
Prior to training, we calculate all possible sparse output dimensionless numbers based on an L1 norm constraint for w , which is the sum of the absolute values for all elements in w . Typically, the L1 norm threshold for output dimensionless numbers is set to 5 (no more than 5 is acceptable). The motivation is that we assume the output dimensionless numbers should have a simple expression, despite the fact that the input dimensionless numbers can be relatively complex. The vapor depression dynamics problem in laser-metal interaction is a typical example. The output dimensionless number is a dimensionless length, keyhole aspect ratio, which is the keyhole depth to laser radius ratio. This output dimensionless number has an L1 norm of only 2, whereas the input dimensionless number (relatively more complex) has an L1 norm of 6.5.
Due to a low L1 norm threshold for the output, there are only a very few candidate output dimensionless numbers. For example, in the problem of keyhole dynamics (Section 2.2 of the main manuscript), there are seven candidate output dimensionless numbers, whereas in the case of turbulent Rayleigh-Bernard convection (Section 2.1 of the main manuscript), there are only two candidate output dimensionless numbers. Then we go through all of the possible output dimensionless numbers. Each time, we choose and fix one of the possible output dimensionless numbers. Next, we use two-level optimization methods to find the best input dimensionless numbers using grid search, gradient-based optimization, or pattern search-based optimization methods, as discussed in Section 4 of the Supplementary Information.
To determine the number of input dimensionless numbers, we perform the two-level optimization with one input dimensionless number and increase the number of input dimensionless numbers if the trained model underperforms, until we find the best representation, i.e., a model with a high R 2 score. This incremental learning scheme helps us to avoid unnecessary dimensionless numbers because if a small number of dimensionless numbers can represent the system, we will not increase the number of input dimensionless numbers. A flowchart of dimensionless learning with an incremental learning scheme is shown in Fig. 1. It is worth mentioning that, while cross-validation is not shown in Fig. 1, we recommend using it to tune the hyperparameters for the fitting function. For example, cross-validation can be used to determine the polynomial degree. There are also examples applying 5-fold cross-validation in Sections 3.4 and 5.3 of the SI.

Representation learning of scaling law
Another important task is to find the best representation f (·), also known as the scaling law. Depending on the problem, different fitting functions such as polynomial regression, neural network-based regression, symbolic regression, or tree-based regression methods can be used. In this section, we will demonstrate dimensionless learning with polynomial functions to simplify the demonstration.
When we use a polynomial with coefficients up to order 3, the causal relationship can be written as: where Π is an input dimensionless number vector with a length of m based on independent variables, and β is polynomial coefficients.
Supplementary Figure 1: A flowchart for dimensionless learning. After exploring parameter space for a complex system, the input and output dimension matrices and basis vectors are computed separately. By embedding dimensional invariance, we select a sparse output dimensionless number with a simple form (i.e., a small L1 norm) and random basis coefficients to initialize input dimensionless numbers. Then, using a two-level optimization method (grid search-based, gradient-based, or pattern search-based two-level optimization), we can obtain the optimal basis coefficients and scaling law given the chosen output dimensionless number. If the model performs well (R 2 exceeds the threshold), the dominant dimensionless numbers and the best scaling law can be determined. Otherwise, all forms of output dimensionless numbers must be explored. If the model continues to underperform, we increase the number of input dimensionless numbers and re-optimize the model. Otherwise, we switch to another possible output dimensionless number.
In summary, there are two sets of coefficients to be identified in order to automatically find m input dominant dimensionless numbers and one output dimensionless number: 1. Basis coefficients {γ j } m+1 j=1 for all dimensionless numbers; 2. Scaling law coefficients such as polynomial coefficients β.
Following the flowchart in Fig. 1, a few possible dimensionless numbers and scaling laws can be obtained. Then, we select the optimal model based on the fitting performance (R 2 score) of the scaling law and the L1 norm of dimensionless numbers. We encourage a higher R 2 score and a lower L1 norm. In a few special situations, when different scaling laws have the same R 2 score and L1 norm, we choose the model with fewer variables. These criteria ensure that the scaling law is parsimonious and fits well on the dataset.

Summary of the demonstrated case studies
We summarize the details of the data and approaches to all the case studies in this paper in Supplementary Table 1 to avoid confusion for the readers.

Dimension matrix and basis vectors used in the main text
Vapor depression dynamics in laser-metal interaction. The dimension matrix D of this example is: Supplementary Table 1: Summary of the demonstrated case studies using the proposed method in this paper. It provides information regarding the data, noise level, and approaches to different problems demonstrated in different sections in the main manuscript and SI.
The dimensions of the effective laser power ηP , the laser scan speed V s , the laser beam radius r 0 , the thermal diffusivity α, the material density ρ, the heat capacity C p , and the difference between melting and ambient temperatures T l − T 0 are represented from the first to the last column of the matrix, respectively. From the first to the last row, it represents the dimensions of length, time, mass, and temperature, respectively.
In this example, the dimension for the dependent variable, i.e., the keyhole depth e, is: The computed input basis vectors w b1 , w b2 , and w b3 for this example are: Porosity formation in 3D printing of metals. The dimension matrix D for this example is: The columns of the matrix, from right to left, represent the dimensions of the effective laser power input η m P , the laser scan speed V s , the laser beam diameter d, the material density ρ, the material heat capacity C p , the thermal diffusivity Supplementary Figure 2: Two-level optimization method for dimensionless learning. The blue blocks indicate that the coefficients are fixed, while the red blocks indicate that the coefficients are trainable.
α, the difference between melting and ambient temperatures T l − T 0 , the hatch spacing between the two adjacent laser scans H, and the layer thickness of the metallic powders L, respectively. The rows, from top to bottom, represent the dimensions of length, time, mass, and temperature, respectively.

Optimization schemes for dimensionless learning
We developed three different optimization methods for dimensionless learning: grid search-based, gradient-based, and pattern search-based two-level optimization methods. The basic idea behind these two optimization methods is to optimize basis coefficients {γ j } m j=1 and polynomial coefficients β sequentially at each iteration when the output dimensionless number is fixed, which is shown in Supplementary Figure 2. For each iteration, there are two levels of training procedures: 1. Level 1: train polynomial coefficients β, while fix the basis coefficients {γ j } m j=1 ; 2. Level 2: train the basis coefficients {γ j } m j=1 , while fix polynomial coefficients β as the newly obtained coefficients in level 1;

Grid search-based two-level optimization
For dimensionless learning, the key point is to identify the basis coefficients γ and polynomial coefficients β. If a problem only involves a few basis coefficients (less than three) and a few input dimensionless numbers (less than two), grid search can work well to find the best coefficients, which is described in Algorithm 1. In this scenario, the grid interval can be relatively small such as 0.01, 0.02, or 0.04, etc. For example, there are three basis coefficients and one input dimensionless number for the turbulent Rayleigh-Bénard convection (Section 2.1 in the main manuscript) and the vapor depression dynamics in laser-metal interaction (Section 2.2 in the main manuscript). A fine grid interval of 0.04 is used in both cases. However, when there are multiple input dimensionless numbers or the number of basis coefficients is large, a large grid interval such as 0.5 is necessary to speed up the optimization. Also, the other two proposed two-level optimization methods in this paper are encouraged to be used to avoid the exponential growth of basis combinations.
Measure the fitting performance with R 2 score. end for Choose the best scaling law Π h = f (Π 1 , Π 2 , ..., Π M ) with a high R 2 score and a low L1 norm of dimensionless numbers.

Gradient-based two-level optimization
The gradient-based two-level optimization is shown in Algorithm 2.
Initial other basis vector coefficients. end for for n ← 1 to N do for i ← 1 to M do Obtain all temporary input dimensionless numbers. Obtain the temporary solution w i of Dw i = 0, where w i = K j=1 γ ij w bj Construct the i-th input dimensionless number Π n i =exp(w iT log(p)) at the n-th iteration. end for Fit Π h = f (Π n 1 , Π n 2 , ..., Π n M ) with a polynomial with degree Q. Level 1. Obtain the polynomial coefficients β n at the n-th iteration. Fix the obtained β n , update {γ n i1 , γ n i2 , ..., γ n iK } M with BFGS algorithm [4]. Level 2. end for end for It is worth noting that if one dimensionless number in the input independent variables does not guarantee a high R 2 score, we will change the number of dimensionless numbers in the input independent variables to two and optimize again.
Supplementary Figures 3 and 4 show the gradient descent algorithm for the turbulent Rayleigh-Benard convection and keyhole dynamics cases, respectively. From these two figures, we notice that with enough iterations, most randomly initialized points can be optimized to the global optimum at 6000 iterations. The global optimal points in these two cases are the same, i.e., (1,1). Therefore, we demonstrate that the gradient descent algorithm for two-level optimization is an efficient method for dimensionless learning. It is noted that these two figures only show points with a positive R 2 score in the last iteration.

Pattern search-based two-level optimization
A pattern search-based two-level optimization for a problem with two dimensionless numbers is shown in Algorithm 3. represent points that converge to the global optimal point (1, 1), 2) blue points represent points that end in locally optimal points, and 3) Green points represent points where the initial R 2 score is less than the threshold of 0.5. Supplementary Figures 3 and 4, pattern search-based optimization requires only 0.5% of the iterations that gradient descent optimization requires, and red initialized points can converge to the global optimal point (1, 1). By selecting the highest R 2 score, we can easily find the best solution, i.e., basis coefficients, from pattern search-based optimization. Therefore, pattern search-based optimization outperforms gradient-based optimization. One reason for its effectiveness is that we aim to find some simple dimensionless numbers, and a coarse grid for pattern search is adequate for locating the global optimal point.

Hyperparameter settings
The rationale behind selecting hyperparameters in this study can be summarized as follows: physical insights and experience, the tradeoff between accuracy and efficiency, and general trial and error selection or cross-validation.
First, based on physical insights and experience, dimensionless numbers should have simple forms. Thus, the exploration range of basis coefficients is set to [-2, 2], and the grid interval can be set to 0.01, 0.1, 0.25, or 0.5. These settings Red points indicate the points that can be optimized to global optimal points. Blue points indicate the points that can only be optimized to local optimal points. Green points indicate the points that stayed at the initial position due to the low initial R 2 score. Colored curves indicate the optimization trajectories. Black contours indicate the range for different R 2 scores obtained from grid search two-level optimization. The grid range is [-2, 2] and the interval is 0.05.
Supplementary Figure 6: Pattern search optimization for keyhole dynamics. Red points indicate points that can be optimized to global optimal points. Blue points are those that can only be optimized to local optimal points. Green points represent points that remained in the initial position due to the low initial R 2 score. The optimization trajectories are represented by colored curves. The black contours indicate the range of R 2 score obtained from grid search two-level optimization. The grid range is [-2, 2] and the interval is 0.05, but only the range in [0, 2] is shown in the figure because this area has a higher R 2 score.

Algorithm 3 Pattern search-based two-level optimization for dimensionless learning
Require: Input variables p, Dimension matrix D, all possible sparse output dimensionless numbers {Π} H , basis vectors {w bi } K , the number of input dimensionless number M , the maximum number of iteration N , polynomial function degree Q, a R 2 threshold for selection model r.
for h ← 1 to H do Loop for all possible output dimensionless numbers.
Fix the i-th input dimensionless number's first coefficient. Divide the domain for γ i2 , γ i3 , ..., γ iK from [-2, 2] with an interval h = 0.05. Randomly initialize basis coefficients γ i2 , γ i3 , ..., γ iK on the grid, which is called center point. end for for n ← 1 to N do for i ← 1 to M do Obtain all temporary input dimensionless numbers. Obtain the temporary solution w i of Dw i = 0: Use a polynomial with degree Q to fit the temporary representation based on the current input dimensionless numbers for the center point and calculate the corresponding R 2 score as R 2 center . if R 2 center < r then Continue.
Randomly choose another initial basis coefficients and calculate the new R 2 center . end if Calculate R 2 score for 3 K − 1 nearby points in a K dimensional Cartesian coordinate system. Pattern search.
Replace the current basis coefficients as the one with the highest R 2 score. end for end for help limit the power exponents of parameters in dimensionless numbers and avoid the identification of complex dimensionless numbers with many decimals.
Second, to achieve a balance between equation accuracy and algorithm efficiency, we recommend using a small grid interval (0.01 or 0.1) for problems with fewer basis coefficients (less than or equal to 4) and a large grid interval (0.25 or 0.5) for problems with more basis coefficients (more than 4). This is because small grid intervals can lead to inefficiency given a high dimensional solution space of basis coefficients. It is also worth mentioning that large grid intervals are usually sufficient to find the optimal dimensionless numbers because of the same assumption as in the first point: dimensionless numbers should be a power law with a simple form. In addition, to avoid being trapped by local minima in gradient-based and pattern search-based approaches, the number of initial points is set to 1,000 in this paper. However, because the computational cost of the proposed method is not high, more random initial points can be chosen.
Third, several hyperparameters should be determined through trial and error or cross-validation. We summarize them as below: 1. The L1 norm threshold for input dimensionless numbers can be different in problems. Higher L1 norm thresholds allow for the discovery of more complicated dimensionless numbers, and vice versa. Specifically, the thresholds in this study are 8, 9, and 13, respectively, for the problems in Sections 2.1 to 2.3 of the main manuscript.
2. In contrast to the input dimensionless numbers, which have a higher L1 norm threshold (greater than or equal to 8), the output dimensionless numbers in this paper are suggested to have a lower L1 norm, such as 5. This is because the output dimensionless numbers are usually related to the quantity of interest / dependent variable. We prefer a simple expression for the output dimensionless numbers so that the quantity of interest is easier to understand.
3. Cross-validation is an effective technique to determine hyperparameters. We demonstrate 5-fold crossvalidation in determining the polynomial degree in the vapor depression dynamics problem in laser-material interaction, which is shown in Fig. 7. The R 2 score of a polynomial function with a degree of three is 0.9702, which is only slightly lower than that of a polynomial function with a degree of four (0.9708). To obtain a simple scaling law without degrading performance too much, we set the polynomial degree to three.
4. It is worth mentioning that because the BFGS optimization algorithm is used in the gradient-based optimization scheme, there is no need to manually select a learning rate or gradient descent step. The BFGS algorithm Supplementary Figure 7: Polynomial degree identification for the keyhole dynamics problem using 5-fold crossvalidation. The error bar shows the 95% confidence interval.
is a second-order optimization algorithm that uses line search algorithms to determine the best learning rate for each iteration. If readers prefer to use other gradient descent methods, they need to select the appropriate learning rate and gradient step through trial and error.
There are two reasons to set the first basis coefficient γ 1 a constant value such as 0.5 or 1. First, a constant γ 1 helps to avoid the identification of equivalent dimensionless numbers. For example, assume that the best basis coefficients are {γ i } K i=1 and that the corresponding dimensionless number is Π. Basis coefficients {Cγ i } K i=1 can also be used to formalize a dimensionless number Π C , where C is a constant. By changing the fitting function, it could also achieve a high R 2 score. Therefore, we need to specify a constant for the first basis coefficient γ 1 . Specifically, we can set γ 1 to 0.5 or 1 to achieve a small L1 norm of a dimensionless number. Other numbers to consider for γ 1 include -0.5 and -1. Second, by keeping one basis coefficient constant, all optimization methods can be made less computationally complex.

Comparison of different optimization methods
In this section, we summarize and highlight the differences between the three proposed optimization methods to clearly demonstrate their advantages and disadvantages. The detailed comparison is shown in Supplementary Table 2. It is worth noting that because we only want simple expressions of dimensionless numbers, zero-order optimization, such as the grid search-based approach or pattern search-based approach, is more suitable for finding the optimal basis coefficients, i.e., dimensionless numbers.
5 Identified dimensionless numbers in the main text 5.1 Vapor depression dynamics in laser-metal interaction.
As we mentioned in Section 1.3 of the Supplementary Information, we can obtain several output dimensionless numbers for a given problem. In keyhole dynamics, after going through all candidate output dimensionless numbers, we identify 7 sets of dimensionless numbers with a high R 2 score in both the training and test sets, which is shown in Supplementary  Figure 8. The majority of their relationship is a straight line, with the exception of the last row, which is a quadratic curve. All of the dimensionless numbers performed greatly in both the training and test sets.
Furthermore, when we look at the best representation in the first row, i.e., e * = 0.12Ke − 0.3, we notice that the bias is negligible when compared to the range of e * ([0, 14]). After removing the bias, the R 2 score only slightly decreases Supplementary  Figure 8) can be regarded as some variants based on the e * = 0.12Ke. This is due to the fact that a valid equation can still be obtained by multiplying a dimensionless ratio on both sides of e * = 0.12Ke. For example, the input and output dimensionless numbers in the second row can be written as r0Vs α Ke and r0Vs α e * , respectively. The same coefficients r0Vs α on the left and right sides of the equation can be canceled out.

Porosity formation in 3D printing of metals.
Since the dependent variable Φ is already dimensionless in this example, there is no other possible output dimensionless number. We only need to determine the best dimensionless numbers for the independent variables, i.e., input dimensionless numbers. We assume that there are two input dimensionless numbers. For each dimensionless number, there are five basis coefficients to be identified and we fix the first one for each dimensionless number. Therefore, in the first level of optimization, we need to determine eight basis coefficients in total.
Because determining the polynomial expression for two input dimensionless numbers is relatively difficult in this case, we use an XGBoost method [5] to capture the second-level relationships.
Here, we use a pattern search-based two-level optimization method. The identified dimensionless numbers are shown in Supplementary Table 3. The first two columns represent the two input dimensionless numbers. We notice that these four sets of dimensionless numbers are based on NED and Ke m L * d . By multiplying some simple dimensionless ratios for NED and Ke m L * d , the R 2 can be dramatically increased. For example, multiplying

Robustness analysis of dimensionless learning
The choice of parameter space has a significant impact on the discovered dominant dimensionless numbers and the best representation because changes in involved variables can directly change dimension matrix D and the corresponding solution w. However, choosing an appropriate parameters space highly relies on researchers' experience and understanding of the problem. In this section, we analyze the robustness of dimensionless learning in the context Supplementary Figure 8: Identified dimensionless numbers for the keyhole dynamics problem. The first and second columns contain the dimensionless numbers as input and output, respectively. The third and fourth columns are the corresponding model performances in the training and test sets, respectively.
Supplementary  of keyhole dynamics by omitting a necessary parameter or adding one additional parameter in the input independent variable. The results show that our proposed dimensionless learning method can not only find the best dimensionless number when additional variables are present but also guide us to include more variables when some important variables are missing.

Omit one necessary variable for independent parameters
Sometimes, due to a lack of knowledge or experience with a complex system, we may not take into account all necessary variables. In this section, we show how dimensionless learning can guide us to involve more variables in the analysis.
Here, we omit a necessary variable in the case of keyhole dynamics, namely the thermal diffusivity α, to see if the system can be expressed as a polynomial related to an input dimensionless number and keyhole aspect ratio e * . The causal relationship can be written as: Since there are six independent variables and four fundamental dimensions, there are two basis vectors w b1 and w b2 . Supplementary Figure 9 shows the R 2 score in the test set for different combinations of basis coefficients γ 1 and γ 2 . The highest R 2 score is less than 0.80, indicating that these six independent variables are insufficient to construct a good scaling law to represent the system. Therefore, we know that we need to rethink the system to determine if we are missing any necessary variables or if there are no dominant dimensionless numbers in the system.

Effects of adding extra variables to independent parameters
First, we add the temperature difference between boiling and room temperatures (T v − T 0 ) as a new variable to the input independent variables, and then we fix the output dimensionless number as the keyhole aspect ratio e * . The causal relationship is as follows: Supplementary Table 4 shows the top five discovered dimensionless numbers with a high R 2 score. When the first two rows are compared, the second expression (keyhole number Ke) does not include the variable (T v − T 0 ). The R 2 score of keyhole number Ke is slightly lower in the training set, but it maintains the same R 2 score in the test set and has a simpler form. The dimensionless number expressions in the last three rows, on the other hand, become more complex after involving (T v − T 0 ) and perform poorly (lower R 2 score).
To quantitatively understand the effect of (T v − T 0 ) on the prediction of keyhole aspect ratio e*, we performed Sobol sensitivity analysis with an open-source python library called SALib [6,7]. We begin by calculating the ranges for each Supplementary Table 4: Training and test set performances of different discovered dimensionless numbers for keyhole dynamics problem to analyze the effects of adding an extra variable (T v − T 0 ) into independent variables, where the output dimensionless numbers are fixed as the keyhole aspect ratio e * . input variable by examining the lower and upper bounds for the experimental data. Then, we generate 2 10 model inputs by using the Sobol sequence [8], which is a quasi-random low-discrepancy sequence method for generating uniform samples. Next, we build a predictive model using the input dimensionless number with the highest R2 scores in the training and test sets and evaluate the model predictions using the generated model inputs. Lastly, we use Sobol sensitivity analysis to analyze the model prediction and compute the sensitivity indices. Fig. 10 shows the sensitivity analysis results for different possible variables. We can see that the sensitivity of (T v − T 0 ) is nearly zero, indicating that this variable has negligible influence on model prediction even though it is in the denominator of the input dimensionless number. Therefore, we choose keyhole number Ke as the input dimensionless number because it not only has a simpler form but also does not consider the unimportant variable. This also demonstrates that adding one additional variable (T v − T 0 ) has no effect on the dimensionless numbers discovered in this problem.
Second, we conduct another experiment in which we introduce a new variable, the latent heat of melting L m , into the input independent variables and fix the output dimensionless number as the keyhole aspect ratio e * . The system can be represented as: Supplementary Table 5 compares the performance of the top five dimensionless numbers discovered with a high R 2 score. Even when an additional variable is taken into account, the keyhole number Ke remains the best dimensionless number because its expression is relatively simple and has a high R 2 score and a low MSE.
We also conduct sensitivity analysis for this experiment. In this case, we consider the input dimensionless number  Table 5. The model associated with this input dimensionless number has high R 2 scores in the training and test sets, albeit slightly lower than that of the model associated with keyhole number Ke. Fig. 11 shows the sensitivity analysis results after including L m in the dimensionless number expression. We notice that the sensitivity of L m is very close to zero, indicating that adding this variable has no effect on the model prediction. Therefore, it is reasonable to omit this variable. In this experiment, we also demonstrate that the proposed dimensionless learning is robust even when an additional variable is considered.

Comparison of different machine learning algorithms with dimensionless learning
The out-of-bag error is a critical metric for machine learning algorithms, especially when dealing with problems with limited data. In this section, we compare the performance of five popular machine learning algorithms with our proposed Supplementary Figure 10: Sensitivity analysis for keyhole dynamics problem after involving an extra variable T v − T 0 . Sobol 1st order indices measure how a single variable influences the output variable, while the Sobol total index measure how single and multiple input variables' affect the output variable. Sobol 1st order indices and total index means the variable has a stronger interaction between other variables.
Supplementary Table 5: Performance of different discovered dimensionless numbers for the keyhole dynamics problem to perform sensitivity analysis for including an additional variable L m into the independent variables, where the output dimensionless numbers are fixed as the keyhole aspect ratio e * . method over two small datasets. The proposed method shows the best generalization in datasets with different materials and scales.

Model generalization for different materials
The first dataset is the same as the one used in Section 2.2 of the manuscript. This dataset contains three different materials: titanium alloy (Ti6Al4V), aluminum alloy (Al6061), and stainless steel (SS316). Here, we do not shuffle all the data and split it into training and test sets. We only use Ti6Al4V and Al6061 data in the training set, while other data (SS316) is used in the test set. 5-fold cross-validation is used to select the best parameters of each model. Fig. 12 shows the R 2 score for each method. Even though the Feed Forward Neural Network (FFNN) achieves very good performance on the training and validation sets, the R 2 score for the test set is below -0.1. Random Forest (RF) has relatively high R 2 scores for training and validation, but it still performs poorly on the test set. On the other hand, the proposed method has the highest R 2 score on the test set, indicating that it has the best generalization in different materials. This improvement is the result of ensuring geometric, kinematic, and dynamic similarities based on similitude theory within different systems.

Model generalization for different scales
In this section, we use a rough pipe flow example to demonstrate that the proposed method has the best generalization ability in data with different scales. Pipe flow is a classic example to analyze fluid flow from laminar to turbulent [12]. The Reynolds number (Re) is widely used in fluid mechanics to distinguish between laminar and turbulent flow. In general, several measurements can be collected in the steady-state of a pipe flow system: fluid velocity v, dynamic viscosity µ, pipe diameter d, fluid density ρ, and pressure p.
Two types of synthetic data are created to build the dataset: small pipe flow data for the training and validation sets, and large pipe flow data for the test set. All the data are generated randomly. The detailed range of each parameter is shown in Supplementary Table 6. Fig. 13 compares the performance of six regression models on the training, validation, and test sets. Because the test set consists entirely of large pipe flow rather than small pipe flow, all machine learning algorithms perform very poorly on the test set (below 0). On the other hand, the proposed method can perform very well on the test set. Even though this dataset is synthetic, it shows that the proposed method can be used for data with different scales.
Supplementary Figure 12: Comparisons of six candidate models to build a generalized model for the generalization of different materials. The training and validation sets are associated with two different materials (i.e., titanium alloy (Ti6Al4V) and aluminum alloy (Al6061)), while the test set is for a third material (i.e., stainless steel (SS316)). The higher the R 2 score, the more accurate the model. The y-axis only shows R 2 scores greater than -0.1. The regression methods include Feed-forward Neural Network (FFNN) [9], Linear regression (LR), Extreme Gradient Boosting (XGBoost) [5], K-Nearest Neighbors (KNN) [10], Random Forest (RF) [11], and the proposed method. The error bar shows the 95% confidence interval. Supplementary Supplementary Figure 13: Comparisons of six candidate models to identify a generalized model for data from different scales. The training and validation sets are generated using small pipe cases, while the test set is associated with large pipe cases. The higher the R 2 score, the more accurate the model is. The y-axis only shows R2 scores greater than -0.1. The error bar shows the 95% confidence interval.

Governing equation discovery from limited data
In this section, we integrate dimensionless learning with symmetric invariant SINDy (Section 7.1 and 7.

Data generation and preprocessing
The vorticity equation is a vorticity form of the Navier-Stokes equation that describes fluid flow by using the evolution of vorticity. To generate the original data, we conducted three sets of simulations for the Kármán vortex street problem with three circular cylinders using Flow3D [13]. The geometry parameters include the diameter of the cylinders l, the domain length in x direction l x , and the domain length in y direction l y . The operational and material parameters considered for this problem include the inlet flow velocity V , the fluid dynamic viscosity µ, the fluid density ρ, and the pressure difference between upstream and downstream p 0 . Other simulation parameters include the number of mesh cells in x and y directions (N x and N y ). The detailed information on these parameters is shown in Supplementary Table  7.
To identify the governing equation from data, 100 time-step snapshots are prepared and sparsely sampled in time and space domains. Fig. 14 shows a typical snapshot and the sampling region. In the sampling region, 4,000 data points are randomly sampled. Each data point includes values for six main variables in order to form the dataset:  Fig. 15 shows a schematic of concatenating the sparse regression terms from the original and transformed data. Symmetric invariant SINDy relies on this operation to implicitly incorporate symmetric invariance into the algorithm. The detailed procedures are described below. First, we flip the original data from each simulation along y = x to obtain symmetrically transformed data, allowing to augment training data. Note that this symmetric transformation fundamentally maintains system properties invariant as it applies to the entire system. Second, the original and transformed data are written for this dynamic system using sparse regression and then concatenated to further determine the sparse vector of coefficients. For example, the columns of ∂ω ∂t and ∂ω ∂x in Fig. 15 are concatenated with ∂ω ∂t and ∂ω ∂x , respectively. The entire concatenated data are used in the optimization, which means the algorithm is optimized with a single "batch" that contains all the training data rather than a mini-batch that contains only a portion of the training data.

Symmetric invariant SINDy
There are three implicit benefits for sparse regression because of this concatenation operation. First, terms with the same absolute values in the original and transformed data must have the same coefficients. For example, the coefficients for ∂ω ∂x (the fourth column) and ∂ω ∂y (the fifth column) must be the same (ξ 5 = ξ 6 ). Second, the learnable coefficients are reduced because of the first benefit, which reduces the solution space and the required number of training data accordingly. Third, the augmented data as a result of this concatenation enables the method to avoid overfitting, because data augmentation is one of the key regularization techniques to prevent overfitting [14]. With this concatenation operation, the training data is doubled, which improves the learning accuracy for the regression coefficients.
The library of candidate terms includes the following 29 standard terms for a second-order partial differential equation: where the dimension of ξ is 29 × 1 and the derivative terms are calculated by a local polynomial interpolation approach described in [15,16]. A second-order polynomial is fitted to each data point using eight neighboring points, and the derivatives can be calculated by the fitted polynomial.
Supplementary Figure 15: Concatenating the regression library for symmetric invariant SINDy from the original and transformed data. Columns with the same color have the same absolute value.
Supplementary To find sparse regression coefficients ξ, we use a sequential threshold least-squares algorithm [15]. After optimization, each case will have only four non-zero regression coefficients. Thus, we are able to construct a non-zero coefficient dataset in combination with the parameters (and their values) listed in Supplementary Table 7 for the three different cases, as shown in Supplementary Table 8. Table 8, we notice that ξ 12 = ξ 19 is very close to a constant value (-1), while ξ 6 = ξ 7 varies with different cases. Therefore, the temporary governing equation can be simplified as:

Dimensionless learning
To obtain a consistent governing equation, we apply dimensionless learning to identify the expression of ξ 6 and ξ 7 . The causal relationship can be expressed as: The dimension matrix D, as described in Section 1 of the SI, for Eqns. 29 is obtained as follows: The columns of the matrix, from right to left, represent the dimensions of µ, l, V , ρ, and p 0 respectively. The rows, from top to bottom, represent the dimensions of length, time, and mass, respectively.
Unlike Sections 2.1 and 2.2 of the main manuscript, where a polynomial is used to fit the relationship between dimensionless numbers, we can simplify the fitting function f (·) in Eqns. 29 as a power law with a constant coefficient β. This is because the obtained temporary equations consist of derivatives multiplied by variable coefficients, which are power-law functions. Then, the Eqns. 29 can be simplified as: where w = [w 1 , w 2 , w 3 , w 4 , w 5 ] T denotes the powers that generate the dimensionless number and are to be determined.
The powers w need to satisfy the following linear system of equations, as explained in Section 1 of the SI: Furthermore, the solutions of the linear system (Eqns. 32) can be written as a linear combination of two basis vectors w b1 and w b2 : where w b1 = [−1, 1, 1, 1, 0] T and w b2 = [−1, 1, −1, 0, 1] T .
To summarize, there are three unknown parameters in the identification of ξ 6 and ξ 7 , including two basis coefficients (γ 1 and γ 2 ) and a constant coefficient β. The training data come from the non-zero regression coefficient dataset in Supplementary Table 8. The dataset includes five input parameters (l, V, µ, ρ, and p 0 ) and one output value for ξ 6 and ξ 7 . Applying a pattern search-based optimization method with a grid search range of [-2, 2] and a grid interval of 0.5, the optimized expressions for ξ 6 and ξ 7 can be written as: This form is similar to the reciprocal of the well-known Reynolds number, indicating that the integration of dimensionless learning can directly identify the expression of sparse regression coefficients. By substituting the expressions for ξ 6 and ξ 7 (Eqns. 34) into Eqns. 28, we obtain a consistent dimensionless governing equation for all cases: which is identical to the well-known Navier-Stokes equation in a vorticity form.

Navier-Stokes equation
In this section, we demonstrate that dimensionless learning in conjunction with symmetric invariant SINDy can be used to identify the Navier-Stokes equation in a Cartesian coordinate system. We performed three series of simulations for the Kármán vortex street problem with three circular cylinders to prepare the original data using Flow3D. The simulation parameters with their values for different cases are shown in Supplementary Table 9. The time-varying measurements include a specific time step (t), two spatial Cartesian coordinates (x and y), and three time-varying measurements (fluid velocity in x direction u, velocity in y direction v, and pressure p).   Following the same procedures described in Section 7.1, the regression model is defined as:

Supplementary
where ξ is a vector with 17 terms, u * is the combination of u and v, v * is the combination of v and u, and p * is the combination of p and p. It is worth mentioning that u * , v * , and p * are designed for the flipping operation so that we can concatenate the regression libraries for the original and transformed data and train them together. This operation helps to incorporate symmetric invariance into the learning scheme. After the transformation, u * and v * become v * and u * , respectively, while p * remains the same as p * .
Following the same procedure as Section 7.1, we obtain the optimized sparse regression coefficients. There are only five terms with non-zero regression coefficients, as shown in Supplementary Table 10.
Because ξ 8 = ξ 13 are close to -1, it is reasonable to assign -1 to both ξ 8 and ξ 13 . Next, we apply dimensionless learning to identify the expression for the other three non-zero coefficients. Using the same pattern search optimization described in Section 7.1, the identified ξ 6 and ξ 7 are 1.0746 µ ρV l , which can be simplified to 1 Re . The ξ 17 is identified as −0.9729 p0 ρV 2 and can be simplified as the negative of a well-known dimensionless number, Euler number, Eu. By substituting all non-zero regression coefficients into Eqns. 36, we obtain a consistent governing equation: which is identical to the Navier-Stokes momentum equation written for the combination of x and y directions.
We also test the noisy data effect on this problem by imposing 0.5% Gaussian noise to the measurements. The identified non-zero coefficients are shown in Supplementary Table 11.
Using dimensionless learning, we can identify the expression for ξ 6 and ξ 7 as 1.0473 µ ρV l . Similarly, the expression for ξ 17 is identified as −0.9687Eu. Compared to the clean data results in Supplementary Table 10, ξ 8 and ξ 13 have higher errors. The ξ 8 and ξ 13 can be simplified as −Eu, yielding the same equation as Eqns. 37.
It is worth mentioning that we choose a different sampling region for the same data in this problem, but all other procedures remain the same as in the previous example. The identified non-zero regression coefficients for clean and noisy data (1% Gaussian noise) are shown in Supplementary Table 12 and Supplementary Table 13.

Supplementary
which is identical to the Euler equation. The reason for simplifying the Navier-Stokes equation to the Euler equation is that the viscous effects of the fluid far from the cylinders and walls are negligible in this problem.

Data generation and preprocessing
The schematic of a spring-mass-damping system is shown in Figure 16.
The system is governed by the equations: Supplementary  Figure 16: A schematic of the spring-mass-damping system. A mass m is suspended from a spring with an initial displacement δ, a spring constant k, and a damping coefficient c. The analytical solution for Equ. 39 with a small damping coefficient c is:

Supplementary
where A = δ 2 + ( where ξ is a vector with five terms.
The optimized sparse regression coefficients have only two non-zero terms: x and d 2 x dt 2 . The non-zero regression coefficients are shown in Supplementary Table 15. The dimensions of ξ 1 and ξ 2 are [T] and [T −1 ], respectively, in order to make Eqns. 41 dimensionally homogeneous. It is worth mentioning that the coefficients in Supplementary Table 15 are different from case to case even though all the data from the same dynamic system with different parameters. This means that if we only use SINDy, we end up with multiple different governing equations for the same dynamic system.

Dimensionless learning
Dimensionless learning is introduced to address the aforementioned issue: inconsistent governing equations for the same dynamic system. It is designed to find the explicit expression for each varying regression coefficient from SINDy. Then, multiple governing equations can be summarized into one consistent parameterized governing equation. The detailed procedures are described below.
First, we explore the parameter space for ξ 1 . The causal relationship for ξ 1 can be expressed as: where the fitting function f (·) is assumed to be a power law with a constant coefficient β. The power law index is defined as a vector w = [w 1 , w 2 , w 3 , w 4 ].
The dimension matrix D for Eqns. 42 is: The dimensions of m, k, δ, and c are represented by the first to last column. From the first row to the last row, it represents the dimensions of mass, length, and time.
After using the pattern search-based optimization method, the identified expression for ξ 1 is −1.000 k c . Similarly, the expression for ξ 2 is identified as −1.001 m c . By substituting these two regression coefficients into Eqns. 41, we obtain a consistent governing equation for all cases: which is identical to Eqns 39 after rearranging the terms into one side. Furthermore, we test the noisy data effects in this problem by imposing 4% Gaussian noise. We follow two steps to reduce the noisy data effect. In the first step, we fit a third-order polynomial to local measurements (60 points) and use the fitted function to reconstruct the original measurements. Next, the Total Variation Regularized Numerical Differentiation (TVDiff) algorithm [17] is used to calculate the first and second derivatives.

Supplementary
The identified non-zero regression coefficients for noisy data are shown in Supplementary Table 16. Following the same procedure as with the clean data, the expressions for ξ 1 and ξ 2 are −0.9997 k c and −1.005 m c , respectively. These two identified expressions are very close to the ground truth − k c and − m c . Therefore, even with the noisy data effect, the proposed method can identify the governing equation for spring-mass-damper systems.

Data generation and preprocessing
The structural impact problem is critical for structures subjected to dynamic loads since large plastic deformation can lead to structural failures [18]. As a special case, the dynamic response of a rigid-plastic beam subjected to a uniform velocity impulse can be solved analytically. Supplementary Figure 18 shows the schematic of a simply supported, rigid-plastic beam with unit width under an initial impulsive loading V 0 . The physical properties of the beam include material density ρ, beam height H, beam length 2L, and yield strength σ 0 .
The dynamic response of this impulsively loaded rigid-plastic beam can be expressed using the following PDE [19]: where M is the moment acting on the beam, V is the velocity of beam deflection. The analytical solution of Eqn. 46 contains two separate stages and is discussed in detail in [18,20]. As shown in Supplementary Table 17, four sets of data using different values for the key parameters including ρ, H, L, V 0 , and σ 0 have been generated to discover the governing PDE. As an example, Supplementary Figure 19 shows the solution for Case 2. Before discovering the differential equation, M , v, x, and t are non-dimensionalized with respect to the corresponding characteristic values: M * = 4M σ0H 2 is the dimensionless moment; v * = v V0 is the dimensionless beam velocity; x * = x L is the dimensionless position; t * = tV0 H is the dimensionless time. Supplementary

Temporary governing equation discovery
Following the same procedures described in Section 7.1, the regression library can be created as follows: where ξ is the coefficient vector with seven terms. After sparse optimization, only one coefficient is non-zero and is shown in Supplementary Table 18.

Dimensionless learning
Because the identified coefficient ξ 3 varies with the parameter settings, the proposed dimensionless learning is used to identify the analytical expression of ξ 3 . The relationship between ξ 3 and other parameters can be expressed as: where the fitting function f (·) is assumed to be a power law with a constant coefficient, as in the previous sections.
In this problem, the corresponding dimension matrix is: regression library to a smaller one. For example, convective and diffusion terms are frequently used in the governing equations of fluid mechanics. It is reasonable to include these terms in the library.
In addition, if there is some prior knowledge about the relationship between different terms, the number of learnable coefficients can be reduced. For example, for a 2-dimensional fluid mechanics problem, we can assume that the two convective and diffusion terms have the same coefficients. It is worth noting that the proposed symmetric invariance does not explicitly include this physical constraint but achieves it implicitly.
In a more general and unknown problem, we suggest including linear terms first, such as the original measurements and their first and second order derivatives. If the fitting performance is ideal, we can stop involving more terms and obtain the best differential equations. Otherwise, we will keep adding quadratic and other high-order terms and testing the fitting performance. The library will be updated until the best performance model is discovered.