Introduction

As a result of an increase in inequality after the Asian financial crisis in 1997, several studies from various perspectives on inequality have been conducted on South Korean society (Birdsall, 2000; Koo, 2007; Kang and Yun, 2008; Lee et al., 2012; An and Bosworth, 2013; Kang and Rudolf, 2016; Koh, 2019). Inequality has recently been the subject of Korean cultural content, for example, the movie Parasite and the TV series Squid Game (Jang, 2021; Jin, 2021; Shin, 2020). Among all the issues related to inequality, the one that has become a sensitive topic for young Koreans is social awareness regarding the unfair distribution of opportunities due to recent political issues (Ban and Kang, 2021; Lee, 2019; Shim, 2019).

To formulate related policies, it is necessary to analyze the origin of inequality of opportunity. This study aims to identify the roots of inequality of opportunity in South Korea by applying algorithmic approaches to survey data. Specifically, it applies a decision tree classification algorithm, light gradient boosting machine (LightGBM), and SHapley Additive exPlanations (SHAP) to estimate the importance of the studied variables and to interpret and analyze the results.

According to Rawls’ (1971) definition of justice, equality of opportunity is an ethical value that enables members of society to pursue their interests through fair and equal opportunities. According to Roemer (1998), equality of opportunity begins with a discussion of factors that individuals cannot be held responsible for, called circumstances, and factors that individuals have control over and take responsibility for, called effort. Generally, there are two approaches to measure inequality of opportunity. One is the ex-ante utilitarian perspective, in which the value of opportunity sets is indicated by the average outcome within the specific type. Individuals sharing the same circumstances are regarded as a type, and if socio-economic disparities between types arise from circumstances, then the result is considered a consequence of inequality of opportunity. This between-type inequality corresponds to a weak criterion of ex-ante inequality of opportunity (Ferreira and Peragine, 2015; Fleurbaey and Peragine, 2009). The other is the ex-post view that focuses on individual outcomes, conditional on effort exertion (Fleurbaey, 1998; Fleurbaey and Peragine, 2009). According to this perspective, equality of opportunity would be satisfied if individual outcomes are equalized within groups exerting the same effort. Individuals with equal levels of effort exertion realize the same outcomes. This study focuses on inequalities between social groups defined by the set of circumstances from an ex-ante utilitarian perspective.

Various approaches have been used in empirical studies to estimate the inequality of opportunity and measure its impact. Among these, the parameter-based approaches, which depend on statistical assumptions of variables, have been limited by bias and model selection (Balcazar, 2015; Roemer and Trannoy, 2016; Brunori et al., 2019a). Moreover, nonparametric test approaches, which partition the sample into each type, have been criticized for arbitrary segmentation (Brunori et al., 2019a). However, this nonparametric approach of the decision tree classification algorithm is free from bias and model selection because it does not make assumptions about parameters and linearity. Furthermore, it partitions the sample using a machine learning algorithm instead of arbitrary segmentation.

The decision tree approach was mentioned in Brunori et al.’s (2018) study, which is consistent with the theory proposed by Roemer (1998). Several studies used this method to measure inequality of opportunity in sub-Saharan African countries (Brunori et al., 2018, 2019b), European countries (Brunori and Neidhofer, 2020), and India (Lefranc and Kundu, 2020), and compared the results with classical parametric and nonparametric approaches.

In contrast, this study aims to identify the roots of inequality of opportunity by estimating the importance of variables, interpreting the estimated results, and analyzing the importance of individual variables, instead of merely measuring inequality of opportunity. Moreover, unlike existing studies that use a regression tree to measure inequality of opportunity across all ages, this study uses a classification tree to group people based on a specific criterion. To identify the roots of inequality of opportunity, a specific group that has lived in a similar era and social environment must be analyzed, and a criterion necessary to define the socio-economic disparity between types must be identified. For instance, if the specific group is millennials and the criterion for the disparity is minimum wage, then the importance of circumstance variables can be estimated based on the minimum wage in the binary classification process of that specific group.

This study utilizes the decision tree classification algorithm in analyzing data, which is known to have overfitting and instability problems; it also utilizes the LightGBM, which overcomes the drawbacks of the decision tree classification algorithm, through the boosting learning method.

While the tree-based models internally calculate the importance of the values of variables, these values may vary depending on how the importance of the variable was computed. SHAP, an algorithm based on game theory, allows consistent estimation of the importance of variables (Lundberg and Lee, 2017a, 2017b). Although LightGBM makes it difficult to interpret the results without knowledge of the process between the input and output of data, like a black box, SHAP overcomes this limitation by allowing interpretation of the estimated results and analysis of the importance of individual variables.

This study also utilizes country-specific circumstance variables provided by the Youth Panel Survey, which reflect circumstances and economic activities of the youth in South Korea. Python 3.7 and Scikit-learn 0.22.2 are used to analyze the data.

The remainder of this paper is organized as follows: section “Background and empirical approaches” describes the background and empirical approaches; section “Methodology” describes the methodology; section “Results and discussion” reports the results of the analysis; and section “Conclusion” concludes the paper.

Background and empirical approaches

Background

According to Rawls (1958, 1971), in an egalitarian theory, justice can be understood as an endeavor to replace equality of results with equality of opportunity. This political philosophy of favoring equality of opportunity can be explained using metaphors, such as “levelling the playing field” or “equality at the starting gate.” The just society envisioned by Rawls is a society in which members are given fair and equal opportunities to pursue their interests. Rawls’ theory of justice is not directly related to an individual’s welfare level but is about the conditions that structure it.

Other philosophical contributions to this discourse were provided by Sen (1980), Dworkin (1981a, 1981b), Arneson (1989), and Cohen (1989). Rawls’ emphasis on primary goods, Sen’s capability approach, Dworkin’s view of equitable resources, and Arneson and Cohen’s individual responsibility and equal opportunity take a slightly different view of equality. Nevertheless, they all value equality of opportunity. In other words, the equality they seek guarantees equal opportunity for each member of society to achieve their desired results.

The discussion that emerged after Rawls presented an ethical justification for equality of opportunity changed the perception of equality and contributed to the philosophical debate surrounding egalitarianism and development. Roemer (1993, 1998) and Fleurbaey (1994, 2008) systematized the measurement of inequality of opportunity through a more precise definition of equality of opportunity. According to their theory, the measurement of inequality of opportunity begins by defining the factors that fall under individual responsibility, and those that do not (Roemer, 1998).

According to Roemer’s (1998) theory, if opportunities are evenly distributed, the consequences of an individual’s choice may be outside the influence of social justice. In other words, the level of equality must be measured not only by the degree of inequality that is currently observed but also by information on the source of its results. Given the same level, this inequality may be the result of personal responsibility or the result of factors beyond personal responsibility, also known as circumstances, such as gender or parental background, which are not considered personal responsibility. Thus, the socio-economic gap resulting from differences in circumstances can be interpreted as resulting from inequality of opportunity.

The Korean context

South Korean society can be divided into several generations: people who were in extreme poverty in the 1950s and the 1960s immediately after the Korean War; people who experienced rapid economic growth in the 1970s and the 1980s; people who were affected directly by economic shocks during the Asian financial crisis in the 1990s; people who entered the labor market after the crisis; and finally, the millennials. Based on both external factors and their respective social environments, each generation is bound to be significantly different from the others (Scitovsky, 1985; Cho and Kim, 1991; Fields, 1994; Lim and Jang, 2006; Kim, 2007; Koo, 2013; Lee, 2017). Millennials share a similar social environment, and the influence of other external factors is relatively small, considering their short period of experience in the labor market.

According to Deloitte (2021), in a survey of 45 countries, 73% of the South Korean millennials who were surveyed answered that wealth was distributed “not fairly equally” or “not at all equally,” which was higher than the global level (69%). These South Korean millennials are witnessing the emergence of a new status order in their society. As a result, the terms “the dirt spoon” and “the gold spoon” are now widely used (Kim, 2017). The English idiom “born with a silver spoon in one’s mouth” has been adopted by South Korean society, which led to the spoon class theory discourse. This theory refers to the idea that an individual’s socio-economic achievement is determined by one’s parents’ income and family background, regardless of one’s efforts.

Since 2015, the spoon class theory has been primarily used by millennials in online communities in South Korea (Kim, 2017). Despite its ambiguous origin and implications, the theory clearly indicates a widespread social perception, particularly among young people, that opportunities are not equally available for everyone. Such self-ridiculing discourse reveals the depth of the younger generation’s animosity toward society. In South Korean society, aside from inadequate compensation for effort exertion, the younger generation is highly dissatisfied with extreme disparities in circumstances between social groups.

In this study, the wage is set as a socio-economic achievement, and any level below the minimum wage is considered the most adverse socio-economic condition. Minimum wage refers to the minimum remuneration paid to wage earners to sustain themselves in society (Starr, 1981; Neumark and Wascher, 2008). It is directly linked to constitutional values, such as human freedom and quality of life. Therefore, living below the minimum wage implies living under the most unfavorable socio-economic conditions. By analyzing the circumstances of this adverse condition, the roots of inequality of opportunity can be identified, thereby enabling an understanding of the circumstances that produce the highest inequalities.

Empirical approaches

This study considers that the wage level of an individual is determined by circumstances, effort, social policy, and luck (Arneson, 1989; Cohen, 1989; Fleurbaey, 1994; Roemer, 1998, 2008; Lefranc et al., 2009; Ferreira and Peragine, 2015; Roemer and Trannoy, 2016). According to Roemer (1998), an individual’s achievement is determined by effort, depending on their responsibilities and circumstances, which are beyond their control. This study analyzes circumstances that create such socio-economic disparities and assumes the following theoretical conditions.

The ultimate wage level of individual yi is created as a function of circumstances Ci, effort ei, social policy φi, and luck li, in Eq. (1). This considers wages and socio-economic achievement as a vector of circumstances, yiY, including a finite number of elements, ciCi,CiC, responsibility characteristics, denoted as effort, eie, social policy related to individual wage level, φiϕ, and luck, l.

$$y_i = f\left( {C_i,e_i,\varphi _i,l_i} \right).$$
(1)

If equality of opportunity is achieved, circumstances do not affect the wage level, according to the condition presented in Eq. (2):

$$\frac{{\partial f\left( {C_i,e_i,\varphi _i,l_i} \right)}}{{\partial C_i}} = 0,\forall C_i.$$
(2)

According to the condition presented in Eq. (3), the effort is distributed independently from the circumstances:

$$G\left( {e_i\left| {C_i} \right.} \right) = G\left( {e_i} \right),\forall e_i,\forall C_i.$$
(3)

In the case of luck, suppose that yi = f(Ci,ei,li) is the function of an individual’s wage-generating process. When circumstances and effort are given, the distribution of wages can be identified, where H(y|Ci, ei) and Fci,ei would be the distribution of luck, as shown in Eq. (4). If the condition presented in Eq. (2) holds, individuals face similar prospects according to their efforts, regardless of circumstances (Lefranc et al., 2009; Roemer and Trannoy, 2016). In other words, the distribution of luck is even-handed, irrespective of circumstances. This allows the distribution of luck to be dependent on effort and independent of circumstances, as shown in Eq. (5).

$$H\left( {y\left| {C_i} \right.,e_i} \right) = F_{C_i,e_i}\left( {f^{ - 1}\left( {y,C_i,e_i} \right)} \right),\forall C_i,\forall e_i.$$
(4)
$$H\left( {.\left| {C_i} \right.,e_i} \right) = H\left( {.\left| {C_j} \right.,e_i} \right) = K\left( {.\left| {e_i} \right.} \right),\forall e_i,\forall \left( {C_i,C_j} \right).$$
(5)

Under the condition presented in Eq. (6), policy φi is independent of Ci. Assuming that Cj−1 is the most disadvantaged circumstance in society and Cj is one level higher, they do not have the same distribution of wages under policy φi as presented in Eq. (7).

$$P\left( {\varphi _i\left| {C_i} \right.} \right) = P\left( {\varphi _i} \right).$$
(6)
$$J\left( {.\left| {C_j} \right.,\varphi _i} \right)\, \ne \,J\left( {.\left| {C_{j - 1}} \right.,\varphi _i} \right).$$
(7)

If the above conditions are established and the outcome generating function of circumstances and effort is f: Ci × ej\({\mathbb{R}}_{+}\), the ultimate wage level of an individual can be rewritten as presented in Eq. (8):

$$y_i = f\left( {C_i,e_i} \right).$$
(8)

This results in a sample of individuals, each of whom is characterized by effort ei and the vector of circumstances Ci. Assuming n circumstances and m efforts, the n-by-m matrix can be expressed as shown in Table 1.

Table 1 n circumstance-by-m efforts matrix.

In the matrix, the sample can be divided into types Ti, which share the same level of circumstances and trenches Ti, which share the same degree of effort. Types Ti and trenches Ti are at the core of the two approaches for measuring the inequality of opportunity: The first is the ex-ante approach that focuses on inequality between types categorized by the conditions of circumstances, and the second is the ex-post approach that focuses on inequality in the outcomes of individuals based on effort exertion (Fleurbaey and Peragine, 2009). In other words, the ex ante approach focuses more on inequalities between social groups defined by the set of circumstances shared by members of each group. Conversely, the ex-post approach focuses more on outcome inequalities among individuals who exert the same effort. This study aims to examine the differences between a specific group in the most adverse social condition and others, and to analyze the circumstances that structurally produce inequality in society. Accordingly, in the context of current study, this study restricts itself only to an ex-ante utilitarian perspective, where an individual xi{1,...,N} constitutes a type that shares the same circumstances, and its society is classified into finite types, Ti = {t1,t2,.,tn}. These classified types are mutually exclusive, and each t has a distribution of pt in the sample.

The problem with an empirical approach that only considers observable circumstances is that the residual domain is not included in the model (Lefranc et al., 2009). If the condition presented in Eq. (3), G(ej|Ci) = G(ej), ej, Ci is valid, the condition presented in Eq. (9) is also established. Therefore, a consistent approach is only possible under observable circumstances.

$$F\left( {.\left| {\widehat C_i} \right.} \right) = F\left( {.\left| {C_i} \right.} \right),\widehat C_i \subseteq C_i.$$
(9)

Observable variables of circumstances are generally a subset of the actual variables (\(\widehat C \subseteq C\)), thereby affecting an individual’s socio-economic achievement. Hence, an estimation can be made only with the observed variables, even if not all the variables in the circumstances are considered. This serves as an advantage when using the nonparametric decision tree classification algorithm, which will be explained in more detail in the section “Methodology”.

Classification of types

This study uses the between-type inequality approach (Bourguignon et al., 2007; Ferreira and Gignoux, 2011; Lefranc et al., 2009; Kanbur and Snell, 2017). Assume that there are n circumstances, i {I,...,n}, and m effortj {1,...,m}, the between-type inequality approach calculates first the mean Yμ={μ1, μ2,...,μn} of the values of each typet, as shown in Eq. (10). This eliminates inequality within each type and maintains the inequality between types.

$$\mu _i = \frac{1}{{N_m}}\mathop {\sum}\nolimits_{x_i \in t_i} {y_{ij},\forall x_i,\forall t_i} .$$
(10)

If there is a gap between types, there is an inequality of opportunity because of the difference in circumstances (ciCi, CiC). Applying this to the approach of this study, every μ is classified as a binary based on minimum wage.

$$k_i = \mu _i\, < \,{{{\mathrm{min}}}}\_wage \, {{{\mathrm{or}}}}\,\mu _i \ge {{{\mathrm{min}}}}\_wage,\forall k_i \in \left\{ {0,1} \right\}$$
(11)

Every μ has a value of 0 or 1, depending on the condition presented above in Eq. (11). In other words, every type is classified as 0 or 1, which can be considered the result of the difference in circumstances.

Methodology

Methodological background

The empirical literature on measuring the inequality of opportunity chooses either parametric or nonparametric tests. Equation (12) shows the reduced form of the regression model proposed by Bourguignon et al. (2007), Trannoy et al. (2010), Ferreira and Gignoux (2011), and Singh (2011), which measures the impact of observable circumstances through parametric tests. Let yi be the socio-economic achievement of an individual i and Ci be the vector of circumstances.

$$y_i = \beta C_i + \varepsilon .$$
(12)

In this model, both the direct and indirect effects of circumstances on yi are captured by the regression coefficients through their effects on effort (Ferreira and Peragine, 2015). The problem is that such a model cannot include every circumstance variable. Thus, the model has a downward bias, which can undermine the inequality of opportunity (Balcazar, 2015; Brunori et al., 2018). Therefore, the coefficient cannot be considered causal.

To solve this problem, researchers generally attempt to reduce downward bias by including more variables, such as interaction variables and higher-order polynomials, into the equation. However, this increases variance and causes an upward bias (Ferreira and Peragine, 2015; Brunori et al., 2019a). Moreover, researchers have to choose a model to deal with the above issues, which can be an important factor in determining the outcome when testing and measuring inequality of opportunity.

In contrast, nonparametric models are free from these issues. For nonparametric tests, inequality of opportunity can be estimated by considering only the observed circumstance variables (Checchi and Peragine, 2010). Researchers can divide the sample into mutually exclusive types based on all the variables being considered. Therefore, the advantage here is that a study does not have to make assumptions about the interaction of variables when analyzing the results.

However, the limitations of nonparametric tests arise when a small sample size is divided into mutually exclusive types, thereby causing an overestimation of results. Therefore, it is necessary to split enough observations into each type (Brunori et al., 2018, 2019a) to ensure the reliability of the estimates. However, in reality, individuals are not evenly distributed among the types. Therefore, during the process of dividing the entire type arbitrarily, the number of circumstance variables should be extremely limited while considering the balance between variables.

The decision tree analysis has been proposed as a data-driven method to overcome the limitations described above (Brunori et al., 2018; Brunori and Neidhofer, 2020). It is classified as a nonparametric machine learning method because it is not based on a probability density function (Murthy, 1998). Moreover, it does not make statistical assumptions about parameters and does not imply that the underlying relationship between variables is linear (Hastie et al., 2009; Murphy, 2012; Hegelich, 2016). These advantages enable the use of only observed circumstance variables in the analysis (Brunori et al., 2018; Brunori and Neidhofer, 2020). Therefore, it is free from potential endogeneity in the model. In addition, it is not necessary to limit the number of variables in the partitioning sample because partitioning is possible through an algorithm, rather than arbitrary judgment.

The decision tree algorithm is divided into classification and regression trees, also known as CART (Hastie et al., 2009). Identifying which method is more appropriate depends on the purpose and empirical approach of the study. The decision tree classification algorithm predicts or explains the response of a categorical dependent variable (Hastie et al., 2009). It divides the sample into mutually exclusive regions based on a specific classification condition. This study aims to estimate the importance of variables in a binary classification according to the ex-ante approach, based on the most unfavorable condition in society. To analyze the inequality of opportunity based on a particular group, a specific criterion that defines the socio-economic disparity with other groups is necessary. If the disparity is based on minimum wage, then estimating the importance of variables in the binary classification based on it is logically consistent with the purpose of this study.

Tree-based classification approaches

In this section, the process by which the decision tree classification algorithm learns the classification rules is examined. The importance of variables in partitioning the sample into mutually exclusive types can be computed. Further, LightGBM can overcome the drawbacks of the decision tree classification algorithm.

First, the process by which the decision tree classification learns classification rules is examined. It finds if/else statements in each variable area and splits it repeatedly to create rules in the entire area. The process is explained as follows (Hastie et al., 2009):

$$\widehat p_{mk} = \frac{1}{{N_m}}\mathop {\sum}\nolimits_{x_i \in R_m} {1_{y_i \in R_m}} .$$
(13)

The data consist of \({y}\,\upepsilon \,{\mathbb{R}}^{n}\), \({X}\,\upepsilon \,{\mathbb{R}}^{{n}\times{p}}\), and each observation is \(({{x}_{i}},{{y}_{i}})\, \upepsilon \, {\mathbb{R}}^{p+1}\), (i = 1,...,n). When the target classification result has two classification values, the decision tree classification method as the predicted value of the dependent variable y as a function of the explanatory variables, I = {I1,...,Ip}. The method partitions the sample into mutually exclusive regions R1,R2,Rm using explanatory variables I. In node m, representing a region Rm with Nm observations, Eq. (13) above indicates the proportion of class k. Depending on the proportion of k, node m has a classification value of 0 or 1 as shown in Eq. (14).

$$k\left( m \right) = {arg}\, {max}_k {\widehat{p}}_{mk}$$
(14)

Considering the partitioning process, based on the growing-splitting rules, the decision tree algorithm repeats the growing and splitting of categories into two regions. The rules were created based on the splitting criteria of the data. Depending on how well the distribution of the target variable (dependent variable) is distinguished, purity or impurity of data is used. In the two-child tree nodes, the variable that maximizes the sum of purity or minimizes the sum of impurity is selected as the splitting criterion (Hastie et al., 2009).

Specifically, the model finds the best condition for partitioning a dataset that results in either the largest sum of purity or the smallest sum of impurity; where a low degree of cross-entropy, as shown in Eq. (15), represents the degree of congestion of the data, or where a high degree of Gini index, as shown in Eq. (16), indicates the uniformity of the data. After splitting iteratively across child nodes, if all data belong to a specific classification, the partitioning stops, and the classification is determined (Hastie et al., 2009).

$${\mathrm {Cross-entropy}}: - \mathop {\sum}\nolimits_{k = 1}^K {\widehat p_{mk}{\mathrm {log}}\widehat p_{mk}} .$$
(15)
$${\mathrm {Gini}}\,{\mathrm {index}}:\mathop {\sum}\nolimits_{k\, \ne \,k^\prime } {{\widehat{p}}_{mk}{\widehat{p}}_{mk^\prime }} = \mathop {\sum}\nolimits_{k = 1}^K {{\widehat{p}}_{mk}\left( {1 - {\widehat {p}}_{mk}} \right)} .$$
(16)

According to the classification principle, this algorithm performs internal variable selection, which is an integral part of the procedure. Further, this algorithm splits the entire region into mutually exclusive regions using the explanatory variables of I, which are the splitting criteria. In other words, the decision tree classification algorithm can compute the importance values of variables in the process of dividing the entire region into mutually exclusive regions.

When the condition presented in Eq. (3), G(ej|Ci) = G(ej), ej, Ci is valid, \(F\left( {.\left| {\widehat C_i} \right.} \right) = F\left( {.\left| {C_i} \right.} \right),\,\widehat C_i \subseteq C_i\) is established, \(F\left( {.\left| I \right.} \right) = F\left( {.\left| {\widehat C} \right.} \right) = F\left( {.\left| C \right.} \right),\,I \subseteq \widehat C \subseteq C\) is established, and the classification value k in Eq. (11) is linked to the value k classified in Eq. (14). When these conditions are met, finite Ti = {t1,t2,..,tn} is partitioned into mutually exclusive types by circumstances and has an analogous meaning to the finite R = {R1,R2,...,Rm}, which is split into mutually exclusive regions by I using the decision tree classification algorithm. In the analysis, the explanatory variables represent circumstances, and the region represents the type. Types that are divided into finite small spaces (regions) have a classification value of 1 or 0, and this algorithm calculates the importance values of circumstance variables I, which is “a subset of C,” in the process of classification.

There are several advantages of the decision tree classification algorithm. In the analysis, it was not necessary to convert categorical variables into dummy variables. Thus, the decision tree classification algorithm can handle continuous and categorical explanatory variables simultaneously without this conversion (Hastie et al., 2009; Murphy, 2012). Besides, this algorithm has the advantage of being insensitive to the monotone transformation of variables (Timofeev, 2004; Murphy, 2012).

Meanwhile, the biggest drawback of the decision tree algorithm is overfitting, which makes it unstable (Li and Belford, 2002; Murphy, 2012; James et al., 2013). This is partly due to the greedy nature of tree splitting. In this regard, small changes to the input data can greatly affect the structure of the tree because of the hierarchical nature of the tree growth process. Therefore, an error at the top can affect the rest of the tree (Li and Belford, 2002; Murphy, 2012).

Ensemble learning is a process designed to overcome shortcomings such as overfitting (Hastie et al., 2009). It refers to the process of generating several decision trees, which are multiple weak learners, and combining them to derive a more accurate and stable final prediction (Kuncheva and Whitaker, 2003).

This study proposes LightGBM as a model to overcome the shortcomings of the decision tree classification algorithm. LightGBM learns using a boosting method, a method of learning that reduces errors by assigning weights to incorrectly predicted observations so that multiple weak learners can predict more accurately while learning sequentially (Hastie et al., 2009; Chen and Guestrin, 2016). Equation (17) explains the key point of forwarding stage-wise additive modeling, which is the fundamental approach of the boosting algorithm.

$$G\left( x \right) = \mathop {\sum}\nolimits_{m = 1}^M {\alpha _mG_m\left( x \right).}$$
(17)

The boosting algorithm produces a sequential weak classifier, Gm(x), m = (1,2,...,M), where G(x) has the final classification value. αm,m = (1,2,...,M), the weight for each weak classifier, is constantly updated to allow better classification in the next step. In other words, successive classifiers are sequentially created and updated from G1(x) to GM(x) by minimizing the loss function to obtain the final classification. Among all boosting models, the models that minimize the loss function of the entire system through the gradient descent method are called gradient boosting decision trees (GBDT). LightGBM belongs to this family of boosting models (Chen and Guestrin, 2016; Ke et al., 2017).

Describing the method of LightGBM in detail, LightGBM’s splitting method is called Gradient-based One-Side Sampling (GOSS). The basic approach of GOSS is analogous to Eq. (17) (Ke et al., 2017). LightGBM is designed in such a way that it inherits the advantages of existing boosting models and compensates for their shortcomings (Ke et al., 2017). Its biggest advantage is that, unlike other GBDT models, it uses the leaf-wise tree growth method. While other GBDT models use the level-wise tree growth method to reduce the depth of the tree, the LightGBM method does not balance the tree but deeply splits the leaf nodes with the maximum delta loss, thereby resulting in an asymmetric tree. As it repeats learning, the tree, generated by continuously dividing the leaf node with the maximum delta loss, reduces greater loss than the level-wise algorithm.

SHapley additive exPlanations

When tree-based models internally compute the importance values of the variables, the values may vary depending on the method of calculation. The importance values of variables can be calculated for a single prediction (individualized) or for an entire dataset to explain a model’s overall behavior (global). Global importance values can be calculated for an entire dataset in different ways, thereby resulting in inconsistent results (Lundberg and Lee, 2017a, 2017b). If the importance values of the variables differ depending on the method of calculation, the reliability of the results inevitably decreases. Furthermore, it is difficult to make meaningful comparisons between the variables. In addition, the machine learning algorithm is limited, thereby making it difficult to interpret the estimated results without knowledge of the process between the input and output of data, like a black box (Burrell, 2016; Ribeiro et al., 2016). With only results and no interpretation, it is difficult to explain and understand the social phenomenon convincingly.

SHAP borrows the concept devised by Shapley (1953) and is based on game theory. As proposed by Lundberg and Lee (2017a), the SHAP value is a measure of the contribution of each variable to the output that interprets the estimated results. This approach estimates the importance of variables based on a solid theoretical foundation consistently and analyzes how each variable affects the output.

SHAP is based on an additive feature attribution method, which explains a model’s output as the sum of real values attributed to each explanatory variable. The goal of SHAP is to estimate the attribution of each variable and explain the results. This approach can be explained by the explanation model, which is a linear function of the binary variables as shown in Eq. (18).

$$g\left( {{{{\mathrm{z}}}}^\prime } \right) = \phi _0 + \mathop {\sum}\nolimits_{i = 1}^M {\phi _iz_i^\prime }$$
(18)

Here, g(z′) is the local surrogate model of the original model, which helps interpret the original model, where z′ = {0,1}M. M is the number of explanatory variables, and \(\phi \,\upepsilon\, {\mathbb{R}}\) (Lundberg and Lee, 2017b). \(z_i^\prime\) equals 1 when a variable is observed; otherwise, it equals 0, and ϕis are the variable attribution values. Focusing on ϕi, the equation to estimate it is presented as Eq. (19) (Shapley, 1953; Lundberg and Lee, 2017a, 2017b).

$$\phi _i = \mathop {\sum }\limits_{S \subseteq N\backslash \left\{ i \right\}} \frac{{\left| S \right|!\left( {M - \left| S \right| - 1} \right)!}}{{M!}}\left( {f_x\left( {S \cup \left\{ i \right\}} \right) - f_x\left( S \right)} \right).$$
(19)

In this equation, N is the set of all explanatory variables and S is defined as the subset of variables from N, SN, not including i. \(\frac{{\left| S \right|!\left( {M - \left| S \right| - 1} \right)!}}{{M!}}\) is the weighting factor that counts the number of permutations of the subset S, and fx(S) is the expected output given the variable subset S, which is like the marginal average of all variables other than the subset S. Since it is necessary to know global importance, the absolute SHAP values per variable across the data are averaged as shown in Eq. (20).

$$I_i = \frac{1}{N}\mathop {\sum}\nolimits_{j = 1}^N {\left| {\phi _i^{\left( j \right)}} \right|} .$$
(20)

The importance values of variables can be determined through Eq. (20), but the results cannot be interpreted and their importance cannot be analyzed. They represent neither the range and distribution of impacts that the variable has on the output nor the relation of the variable’s value to output. However, the SHAP summary plot can be utilized, which uses \(\phi _i^{\left( j \right)}\) to convey all aspects of the importance of variables while remaining visually concise (Lundberg and Lee, 2017b).

Evaluation

In the literature on machine learning algorithms, the importance of model performance varies depending on the objective. Research focuses either on accurate prediction or understanding the relationships between variables (Celiku and Kraay, 2017; Hegelich, 2016). This study estimates the importance of variables in the process of dividing the sample into mutually exclusive types and interpreting the results, instead of making accurate predictions. However, the study reports the evaluation results in the form of a comparison between the performance of the decision tree classification and LightGBM. Data are divided into training and test data (8:2), and the performance of the model is evaluated using test data (Bonaccorso, 2018). In addition, accuracy, precision, recall, F1, and ROC-AUC are utilized as evaluation metrics for predictive performance evaluation (Powers, 2011) (see Appendix A).

Data

Since variables related to circumstances may differ depending on the socio-cultural characteristics of the society to which the individual belongs, they should be collected based on a sufficient understanding of each society (Roemer, 1993). The Youth Panel Survey conducted by the Korea Employment Information Service, affiliated with the Ministry of Employment and Labor, is analyzed. The Youth Panel Survey provides country-specific circumstances of the millennials of South Korea and their current wages. The population of the Youth Panel Survey included males and females from all over the nation, between the ages of 15 and 29, in 2007. The sample was extracted using the multi-stage area probability sampling method, and the survey was conducted in 2017 using a person-to-person interview method. In 2007, respondents were asked questions in a multiple-choice questionnaire about their circumstances around the age of 14, and in 2017, the same respondents were surveyed about their current wages. These data make it possible to analyze the socio-economic status of the survey respondents after 10 years. This study does not apply the approach of converting choices into dummy variables, considering the characteristics of the decision tree algorithm discussed in the previous sections (see B2 in Appendix B).

Parameters such as parental education, jobs, and other family background are widely used in the empirical literature on the inequality of opportunity (Brunori et al., 2018; Brunori and Neidhofer, 2020; Ferreira and Gignoux, 2011; Palomino et al., 2019; Roemer and Trannoy, 2016). Family structures, such as single-parent families, or living without parents, can affect a broad set of outcomes at a particular point in a child’s life (Conway, 2012; Martin, 2006; Smock and Manning, 1997). In South Korea, Choi and Min (2015) proved that parents’ education and income levels affect their offspring’s educational achievement and performance in the labor market after graduation, through linear estimations. Oh and Ju (2017) revealed that there is significant inequality of opportunity in income acquisition between advantaged and disadvantaged circumstances, such as a father’s education and occupation, through nonparametric tests.

In South Korea, there are significant gaps among regions in terms of the level of economic development and public and educational services (Kim and Jeong, 2003; Byun and Kim, 2010; Jeon, 2012). While gender inequality in the labor market of South Korea has continuously improved in terms of labor force participation rates, gender wage gap, and the proportion of regular workers, inequality of opportunity remains an existing social phenomenon (Park, 2007; Kim et al., 2016).

Regarding the tenancy status of the house and housing types in South Korea, the proportion of people living in condominiums increases as economic status increases. In contrast, the proportion of people living in multiplex housing units tends to increase with a lower economic status (Ha, 2002, 2004, 2007; Kim, 1997). In addition, the tenancy status of the house is divided into “owned,” “lease on a deposit basis,” and “monthly rent,” depending on the economic status (Ha, 2002, 2004, 2007; La Grange and Jung, 2004).

Hence, the following parameters can be treated as circumstances for this analysis: the region where the respondent lived, the respondent’s gender, whether the respondent lived with their parents, the job and occupational position, level of education, and physical presence of the respondent’s parents, the number of parents working for economic activities, the number of siblings the respondent has, and the tenancy status and housing type of the respondent, all around the age of 14. The dependent variable has a binary classification value based on the minimum wage level of 6470 Korean won (KRW) per hour in 2017, which is equivalent to a monthly salary of 1,352,230 KRW and 209 h (see Appendix B).

Results and discussion

First, the estimation results of the decision tree classification and LightGBM are interpreted, and then the importance of individual variables is analyzed. The SHAP summary plot in Fig. 1 shows the estimated results of decision tree classification. Based on this, the region where the respondents lived is estimated to be the most important variable in the classification. This is followed by the respondents’ gender, father’s job, mother’s job, father’s position, housing type, number of siblings, tenancy status, father’s education, mother’s education, mother’s occupational position, the physical presence of their parents, number of working parents, and living with their parents. The summary plot shows that the influence and intensity of these parameters gradually decrease from the top to the bottom. However, it is difficult to clearly distinguish between the variables in the plot, making it difficult to interpret and analyze the results. This is due to the unstable nature of the decision tree algorithm, which limits interpretation.

Fig. 1: Decision tree classification SHAP summary plot.
figure 1

Note: (1) The order from the top vertically indicates the importance of the variable. (2) Red color indicates a high value and blue color indicates a low value of the variable. (3) The horizontal axis denotes the impact of the value of the variable on the output. (4) The density achieved by the dots indicates their intensity.

The following SHAP summary plot in Fig. 2 shows the estimated results of LightGBM. Based on this, the region where the respondents lived, their gender, their father’s job, their mother’s job, and the tenancy status of their houses are the five most important variables. In the case of a region, like the decision tree classification algorithm, the region’s impact on the output and degree of intensity is much greater than that of the other variables. In other words, the region where the individual lived contributes the most in making a difference in their levels of achievement.

Fig. 2: LightGBM SHAP summary plot.
figure 2

Note: (1) The order from the top vertically indicates the importance of the variable. (2) Red color indicates a high value and blue color indicates a low value of the variable. (3) The horizontal axis denotes the impact of the value of the variable on the output. (4) The density achieved by the dots indicates their intensity.

Regarding gender, the division of color in LightGBM is more uniform in both directions as compared to the decision tree classification. Generally, males work in the positive direction, whereas females work in the negative direction. The father’s job has more impact than the mother’s job, except when a high-value housewife or retired mother works in a positive direction. When considering the parents’ education as a determining factor, the father’s educational background is relatively higher than the mother’s. Considering the impact of the father’s job and educational background, and the respondent’s gender on the output, the overall impact of gender is socially significant.

Regarding the tenancy status of the house, the category with high value acts strongly in the positive direction, and the category with a low value affects the output negatively. When the status of the house is “owned,” which indirectly reflects the level of wealth, a positive output is indicated. In the case of the number of parents working, there is a mixture of high and low values, but the engagement of both parents has a slightly greater positive impact.

Considering the father’s educational background, a value slightly greater than the average (dark pink) works in the positive direction, and a very high value (red) works in the negative direction. Thus, it can be assumed that the father’s educational background works in a positive direction at the college level. In addition, it can be observed that the number of siblings, when few, acts in the positive direction. Considering that the average number of siblings is 2.3, it can be concluded that the output is positive when the number is below 2.

Table 2 lists the evaluation results according to the evaluation metrics for each model. When comparing the two models, the evaluation values significantly increase in all evaluation metrics of the LightGBM compared to the decision tree classification (see Fig. A1 in Appendix A for ROC-AUC plot).

Table 2 Evaluation results.

The interpretation and analysis of the SHAP are summarized as follows: For the decision tree classification, although the importance of variables can be estimated through SHAP, it is difficult to interpret the results accurately because of the unstable characteristics of the algorithm; in contrast, LightGBM is more stable as it sequentially updates multiple classification learners, which is apparent in the SHAP summary plot and evaluation results. Thus, for the interpretation and analysis of the importance of individual circumstance variables, LightGBM seems to be more reliable and appropriate.

Using the application of tree-based models, SHAP, and surveys, the roots of inequality of opportunity in South Korea can be identified. Region, gender, and father’s job are the most important circumstance variables. Among them, in terms of impact and intensity of output, the region ranks higher. Evidently, the socio-economic achievement of an individual is greatly influenced by the region where the respondents lived during childhood. On another note, among the parents, both the impact and intensity of the output of the father’s job are stronger than those of the mother’s job. Furthermore, considering the educational background of both parents, the influence of the father on the younger generation has a greater impact. Interpreting this together with the impact of the respondents’ gender, it becomes evident that in South Korean society, there is a social environment in which gender inequality exists.

Conclusion

This study identifies the roots of inequality of opportunity by applying algorithmic approaches and using survey data. The combination of tree-based classification models and SHAP estimates the importance of circumstance variables consistently and analyze which variables strongly influence output and how the values of variables affect it.

The main factors of inequality of opportunity commonly estimated in this study, through the application of SHAP, decision tree classification, and LightGBM are as follows: (1) region, (2) gender, and (3) father’s job. Considering the SHAP summary plot and evaluation results, between the two models, LightGBM provides more stable and reliable results for interpretation and analysis.

Region, gender, and father’s job are the main factors that form the most unfavorable socio-economic conditions for millennials. Region has an enormous impact on an individual’s socio-economic achievement, and gender plays a significant role in contributing to the inequality of opportunity. The results of this study suggest that females may have fewer equal opportunities. Based on the factors related to parents’ background, the father’s job and educational background are considered more important variables than the mother’s: the father’s background strongly influences an individual’s socio-economic achievements. Considering both the effects of the father’s background and respondents’ gender, the overall effects of males are socially significant.

It is worth noting that this study proves that a huge regional disparity exists in South Korean society. Phrases that represent specific spaces, such as the capital metropolitan area versus rural provinces, in-Seoul versus out-of-Seoul, and Gangnam (rich, south of the Han River) versus Gangbuk (poor, north of the Han River), reflect an individual’s identity, social status, and class (Bae and Joo, 2020; Park and Jang, 2020; Yang, 2018). Phrases that define regions in specific ways mean that the opportunities available to individuals vary depending on where they grew up. Whether an individual grew up in the Seoul metropolitan area, in a rural area, or in Gangnam within Seoul, affects one’s achievements in South Korean society in many ways. Inequality can be structurally reproduced if a certain group of people living in a certain area monopolizes opportunities, or if some people are spatially excluded from opportunities provided by society (Soja, 2010). The results of this study provide evidence to partially prove this.

The society reflected in the analysis results of this study is different from that of Rawls (1971). This raises the question of how a society with equally distributed opportunities can be created. The answer lies in considering how an individual’s socio-economic achievement becomes the outcome of circumstances, effort, social policy, and luck. Out of these factors, circumstances are not dependent on an individual’s choice and cannot be easily changed.

According to the analysis of this study, individuals receive unequal opportunities owing to a combination of region, father’s background, and their own gender, thereby affecting their socio-economic achievements. If these factors remain influential from birth to adulthood, removing the conditions that structure them would be one way to achieve equality of opportunity. The ultimate goal of our society is to find policies that minimize the impact of circumstances and make the results more sensitive to effort.

The limitations of this study, along with suggestions for follow-up studies, are as follows: (1) while the current study applies algorithmic methods to the empirical approach of inequality of opportunity, the result is tentative and requires further discussion, particularly on the connection between theory, the empirical approach, and algorithms; (2) although this study selected wages as the criteria of the socio-economic gap, there may be other various criteria, not covered in this study, that may be considered in a follow-up study; (3) while the results suggest that region has the greatest influence on the inequality of opportunity in South Korean society, it is not possible to determine how the regions are stratified and the disparities among them. The analysis of regional disparity is beyond the scope of this study and should be considered by future research.

While this study has a few shortcomings, it still contributes to the development of the analysis of inequality of opportunity based on machine learning algorithms, analyzes the roots of inequality in South Korea, and complements previous studies with the help of a novel approach. Above all, this study contributes to the literature not only by describing social phenomena with data-driven methods, but also by trying to connect classic work, empirical approaches, and machine learning algorithms.