A universal model for the Lorenz curve with novel applications for datasets containing zeros and/or exhibiting extreme inequality

Sitthiyot, Thitithep; Holasut, Kanyarat

doi:10.1038/s41598-023-31827-x

Download PDF

Article
Open access
Published: 23 March 2023

A universal model for the Lorenz curve with novel applications for datasets containing zeros and/or exhibiting extreme inequality

Thitithep Sitthiyot¹ &
Kanyarat Holasut²

Scientific Reports volume 13, Article number: 4729 (2023) Cite this article

1044 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Given that the existing parametric functional forms for the Lorenz curve do not fit all possible size distributions, a universal parametric functional form is introduced. By using the empirical data from different scientific disciplines and also the hypothetical data, this study shows that, the proposed model fits not only the data whose actual Lorenz plots have a typical convex segment but also the data whose actual Lorenz plots have both horizontal and convex segments practically well. It also perfectly fits the data whose observation is larger in size while the rest of observations are smaller and equal in size as characterized by two positive-slope linear segments. In addition, the proposed model has a closed-form expression for the Gini index, making it computationally convenient to calculate. Considering that the Lorenz curve and the Gini index are widely used in various disciplines of sciences, the proposed model and the closed-form expression for the Gini index could be used as alternative tools to analyze size distributions of non-negative quantities and examine their inequalities or unevennesses.

A new unit distribution: properties, estimation, and regression analysis

Article Open access 27 March 2024

A simple method for estimating the Lorenz curve

Article Open access 16 November 2021

A new class of Poisson Ridge-type estimator

Article Open access 27 March 2023

Introduction

The distributions of sizes vary significantly in both nature and society. Extreme inequalities in size distributions are also not unusual. In nature, based on the datasets used in Newman’s study¹, the share of the top 10% of earthquake intensity is equal to 16% of total share of earthquake intensity while the share of the top 10% of solar flare intensity accounts for 85% of total share of solar flare intensity. In addition, the top 10% of mammal species’ body mass has a share of 99% of total share of mammal species’ body mass². The degree of metabolic network of the bacterium Escherichia coli exhibits a similar pattern in that the share of the top 10% of metabolic network accounts for 99.9% of total share of metabolic network³. In society, according to the data from the American Federation of Labor and Congress of Industrial Organizations (AFL-CIO)⁴, the top 10% of compensation of chief executive officers (CEOs) has a share of 27% of total share of CEOs’ compensation whereas the data on salary of professional athletes⁵ show that the top 10% of professional women tennis players’ salary has a share of 76% of total salary share of professional women tennis players. Furthermore, based on the inter-state war data⁶, the share of the top 10% of war intensity as measured by the number of deaths per battle accounts for 91% of total share of the number of deaths per battle.

To analyze the distributions of sizes and examine the inequalities in size distributions, a tool that has been commonly used for more than a century is the Lorenz curve. It was originally developed by an American economist named Max O. Lorenz⁷ as a method for measuring wealth concentration. The Lorenz curve depicts a graphical relationship between the cumulative normalized rank of population from the poorest to the richest (the abscissa) and the cumulative normalized wealth held by these population from the poorest to the richest (the ordinate). The application of the Lorenz curve is not limited to economics, however. According to Eliazar and Sokolov⁸, the use of the Lorenz curve has grown beyond economics and reached various disciplines of sciences.

There are three popular methods that could be used to estimate the Lorenz curve. They are: (1) interpolation techniques (2) specifying a statistical distribution of size and deriving the corresponding Lorenz curve and (3) specifying a parametric functional form for the Lorenz curve. Given that the interpolation techniques underestimate inequality unless the individual data on size are available and no single statistical distribution has proved to be adequate for representing the entire size distribution⁹, numerous studies have proposed a variety of parametric functional forms in order to directly approximate the Lorenz curve^{9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34}.

According to Dagum³⁵, a good parametric functional form for estimating the Lorenz curve has to be able to describe the distributions of sizes via the changes in parameter values. The specified functional form should also provide a good fit for the entire range of size distribution since all observations are relevant for an accurate measurement of inequality or unevenness. Sitthiyot and Holasut³⁴ note that while many popular parametric functional forms for the Lorenz curve do not have a closed-form expression for the Gini index, making it computationally inconvenient to calculate since they require the valuation(s) of the beta function^11,12,13,19, or the beta and the gamma functions²⁴, or the confluent hyper-geometric function¹⁷, a good functional form for the Lorenz curve therefore should have an explicit mathematical solution for the Gini index³⁵. In addition, Dagum³⁵ suggests that a good parametric functional form should employ the smallest possible number of parameters for adequate and meaningful representation with well-defined meanings. While three- or four- parameter functional form implies a loss in simplicity, a functional form that fits the empirical data well with an associated inequality measure such as the Gini index usually requires more than two parameters. Furthermore, Dagum³⁵ notes that, from a viewpoint of computational cost and the acceptance of the specified functional form in applied sciences, a simple method of parameter estimation is always an advantage.

Thus, finding a parametric functional form that satisfies the aforementioned properties of a good parametric functional form for the Lorenz curve as suggested by Dagum³⁵ is a theoretical and practical challenge. In addition, given the fact that sizes of events or things that occur in nature and society could contain zeros such as the magnitude of earthquake intensity based on the Richter scale, the number of connections in metabolic network in living organisms, and the degree of war intensity as measured by the number of deaths per battle, to our knowledge, no study has proposed a parametric functional form that takes this possibility into account by allowing the Lorenz curve to have a horizontal-line segment in addition to a typical convex segment. Empirically, it is also possible that one observation has a larger size compared to the rest of observations whose sizes are smaller and more or less equal. For example, in nature, queen termites could live for 20 years whereas worker termites live only a few weeks to months³⁶. In society, one person could have a majority of income share while the income shares of the rest of population are more or less equal. Based on these two examples, a good parametric functional form for the Lorenz curve that fits extreme inequality in the distributions of age of termites and income of population should have two positive-slope linear segments. Also, to our knowledge, there is no existing parametric functional form for the Lorenz curve whose performance is up to this task.

To address the key issues with regard to the existing parametric functional forms for estimating the Lorenz curve as discussed above and to fill the gap in the literature on the Lorenz curve, this study introduces a universal parametric functional form for the Lorenz curve (proposed model) that has a closed-form expression for the Gini index. Our proposed model has four parameters. It comprises a linear function and a linear combination of two convex functions which are the exponential function and the functional form implied by Pareto distribution. According to Ogwang and Rao²¹, a linear combination is a way to circumvent an important drawback of traditional parametric functional forms for the Lorenz curve which is the lack of satisfactory fit over the entire range of a given size distribution. Note that the mixture of the exponential function and the functional form implied by Pareto distribution accounts for the convex segment of the Lorenz curve. The linear function is incorporated in order to characterize the horizontal-line segment of the Lorenz curve where a certain number of observations have a size of zero and/or to represent the positive-slope linear segments of the Lorenz curve where the size of one observation is larger than the sizes of the rest of observations which are smaller and more or less equal.

To demonstrate the performance of our proposed model for the Lorenz curve, the empirical data on sizes of events or things, occurring in nature and society, from across scientific disciplines are used. They are earthquake intensity, solar flare intensity, mammal species’ body mass, metabolic network of the bacterium Escherichia coli, CEOs’ compensation, salary of professional women tennis players, and inter-state war intensity. These datasets are publicly available and almost all of them can be accessed from the sources mentioned earlier. Note that the data on the earthquake intensity, the metabolic network of the bacterium Escherichia coli, and the intensity of inter-state war contain numerous observations whose sizes are equal to zero. In addition, a hypothetical dataset is created in order to illustrate how the proposed model could be used to fit the distribution of size where one observation has a larger size compared to the others which have smaller and equal size. In addition, this study compares the performance of the proposed model to that of Sarabia et al.²⁴ (SCS model) which, according to Tanak et al.³⁷, is considered the best performer among a number of different functional forms for the Lorenz curve in fitting to the data. Although the SCS model has been shown to outperform a number of different parametric functional forms for the Lorenz curve, it has three parameters. Thus, to level the playing field, a parametric functional form that contains four parameters developed by Sarabia²³ (S model) is also employed for the performance comparison. The main reason that we choose the S model because it has been demonstrated to fit the data better than other well-known parametric functional forms for the Lorenz curve such as Chotikapanich⁹, Kakwani and Podder¹⁰, Rasche et al.¹³, and Arnold¹⁶. Given that the Lorenz curve and the Gini index are extensively used in numerous disciplines of sciences, our proposed model for the Lorenz curve with a closed-form expression for the Gini index could be used as an alternative tool for analyzing size distributions of non-negative quantities and examining inequalities or unevennesses.

Methods

Let $x$ be the cumulative normalized rank of size from the smallest to the largest and $y$ be the cumulative normalized size from the smallest to the largest, where $0\le x\le 1$ and $0\le y\le 1$. In addition, let $\delta ,\rho , \omega ,$ and $P$ denote parameters, where $0\le \delta <1$, $0\le \rho \le 1$, $0\le \omega \le 1$, and $P\ge 1$. While there is a vast family of existing and already known parametric functional forms for estimating the Lorenz curve that could be used in combination by assigning a weight between 0 and 1 to each functional form such that the sum of all weights is equal to 1³⁴, our proposed model is characterized by three functions which are as follows:

Linear function:

$$y\left(x\right) = \left(\frac{2}{P+1}\right)*\left(\frac{x-\delta }{1-\delta }\right),$$

(1)

given

(1)
$y\left(x\right)=0$ when $x-\delta <0$,
(2)
$y\left(x\right)=\left(\frac{2}{P+1}\right)*\left(\frac{x-\delta }{1-\delta }\right)$ when $0\le x-\delta <1$,
(3)
$y(x)=1$ when $x=1$.

Exponential function:

$$y\left(x\right)={\left(\frac{x-\delta }{1-\delta }\right)}^{P},$$

(2)

given

(1)
$y(x)=0$ when $x-\delta <0$,
(2)
$y\left(x\right)={\left(\frac{x-\delta }{1-\delta }\right)}^{P}$ when $0\le x-\delta \le 1$.

Functional form implied by Pareto distribution:

$$y\left(x\right)=1-{\left(1-\left(\frac{x-\delta }{1-\delta }\right)\right)}^\frac{1}{P},$$

(3)

given

(1)
$y(x)=0$ when $x-\delta <0$,
(2)
$y\left(x\right)=1-{\left(1-\left(\frac{x-\delta }{1-\delta }\right)\right)}^\frac{1}{P}$ when $0\le x-\delta \le 1$.

Note that, if we separately take the integral of the linear function, the exponential function, and the functional form implied by Pareto distribution from 0 to 1, it can be shown that each functional form has the same area under the Lorenz curve which equals $\frac{\left(1-\delta \right)}{\left(P+1\right)}$. We categorize these three functional forms into two components. The first component is the linear function and the second component is the weighted linear convex combination of the exponential function and the functional form implied by Pareto distribution, where the weight $(1-\omega )$ is assigned to the exponential function and the weight $\omega$ is assigned to the functional form implied by Pareto distribution. As discussed in Introduction, the weighted linear combination of the exponential function and the functional form implied by Pareto distribution represents the convex segment of the Lorenz curve whereas the linear function characterizes the horizontal-line segment of the Lorenz curve where a certain number of observations have a size of zero and/or to represent the positive-slope linear segments of the Lorenz curve where the size of one observation is larger than the size of the others which are smaller and have approximately equal size. By assigning the weight $(1-\rho )$ to the linear function and the weight $\rho$ to the weighted average of the exponential function and the functional form implied by Pareto distribution, the proposed model for the Lorenz curve can be shown as Eq. (4).

$$y\left(x\right)=\left(1-\rho \right)*\left[\left(\frac{2}{P+1}\right)*\left(\frac{x-\delta }{1-\delta }\right)\right]+\rho *\left[(1-\omega )*{\left(\frac{x-\delta }{1-\delta }\right)}^{P}+\omega *\left(1-{\left(1-\left(\frac{x-\delta }{1-\delta }\right)\right)}^\frac{1}{P}\right)\right].$$

(4)

Our proposed model satisfies all necessary and sufficient conditions for the theoretical Lorenz curve which are as follows:

(1)
$y\left(0\right)=0$,
(2)
$y(1)=1$,
(3)
$\frac{dy}{dx}=\left(1-\rho \right)*\left[\left(\frac{2}{P+1}\right)*\left(\frac{1}{1-\delta }\right)\right]+\rho *\left[\frac{\left(1-\omega \right)*P*{\left(\frac{x-\delta }{1-\delta }\right)}^{P}}{\left(x-\delta \right)}+\omega *\left(\frac{1}{P}\right)*\frac{{\left(\frac{1-x}{1-\delta }\right)}^\frac{1}{P}}{\left(1-x\right)}\right]\ge\,0,\,given\,0\le x-\delta\,\le\,1,$
(4)
$\frac{{d}^{2}y}{{dx}^{2}}=\rho *\left[\left(1-\omega \right)*\frac{\left(P-1\right)*P*{\left(\frac{x-\delta }{1-\delta }\right)}^{P}}{\left(x-\delta \right)}+\omega *\frac{\left(\frac{P-1}{P}\right)*\left(\frac{1}{P}\right)*{\left(\frac{1-x}{1-\delta }\right)}^\frac{1}{P}}{{\left(1-x\right)}^{2}}\right]\ge\,0\,,\,given\,0\le x-\delta\,\le\,1.$

Note that if all necessary and sufficient conditions for the theoretical Lorenz curve as specified above are satisfied, the Lorenz curve is convex except when the parameters $\delta =0$ and $P=1$, the Lorenz curve is linear. Figure 1 illustrates the Lorenz curve that is consistent with the proposed model.

According to Sitthiyot and Holasut³⁴, different scientific disciplines may have their own theoretical justifications when applying the parametric functional form for the Lorenz curve to examine size distributions of non-negative quantities and calculate statistical evenness measures. However, irrespective of disciplines of sciences, the parameter $\delta$ measures the distance along the horizontal-line segment of the Lorenz curve where the value of the cumulative normalized size $\left(y\right)$ is equal to zero for a given range of the cumulative normalized rank of size $\left(x\right)$ as illustrated in Fig. 1. As described above, the parameter $\rho$ is the weight given to the convex segment of the Lorenz curve while the weight $\left(1-\rho \right)$ is given to the linear segment of the Lorenz curve. The parameter P represents the degree of inequality or unevenness in size distribution as measured by the Gini index. The parameter $\omega$ is the weight that controls the curvature of the Lorenz curve such that the Gini index remains unchanged since, for a particular value of parameter $P$, there are infinite values of parameter $\omega$ that could give an identical value of the Gini index. The parameter $\omega$ thus provides the information about size shares in case two or more Lorenz curves intersect. In addition, from an analytical point of view, the key advantage of using the weighted linear convex combination of the exponential function and the functional form implied by Pareto distribution is that the shape of the estimated Lorenz curve could be handily adjusted via the change in parameter $\omega$ while the value of the Gini index is held constant. This may not be easily done for linear convex combinations of other functional forms for the Lorenz curve. To our knowledge, no study has employed a parametric functional form for estimating the Lorenz curve by combining the linear function, the exponential function, and the functional form implied by Pareto distribution before. The closest one is the model proposed by Sarabia²³ whose parametric functional form represents a linear convex combination of the egalitarian line, the power Lorenz curve, and the classical Pareto Lorenz curve. Based on our proposed model as shown in Eq. (4), the area under the Lorenz curve and the closed-form expression for the Gini index can be conveniently calculated as shown as Eqs. (5) and (6), respectively.

$${\int }_{0}^{1}y(x)dx= \frac{\left(1-\delta \right)}{\left(P+1\right)}.$$

(5)

$${\mathrm{Gini\,index}}_{\mathrm{Proposed}}=1-2*{\int }_{0}^{1}y\left(x\right)dx = 1-2*\left(\frac{1-\delta }{P+1}\right),$$

(6)

$$0\le\,{\mathrm{Gini\,index}}_{\mathrm{Proposed}}\le 1.$$

The Gini index takes values between 0 and 1. The closer the index is to 0, the more equal the distribution of size whereas the closer the index is to 1, the more unequal the size distribution. The formulae for calculating the area under the Lorenz curve and for computing the closed-form expression for the Gini index are also shown in Fig. 1.

To demonstrate the performance of the proposed model for the Lorenz curve, this study utilizes the data on sizes of events or things that occur in both nature and society from different scientific disciplines. In addition, we create a hypothetical dataset, representing a society that exhibits extreme inequality in income distribution in that 99 persons have an equal income of one unit and only one person has income of 99 units, in order to illustrate that our proposed model could be used to fit the distribution of size where one observation has a larger size than the others which have a smaller and equal size. The list of data and their sources are provided in Table 1.

Table 1 The list of the data on sizes of events or things occurring in nature and society and the hypothetical data as well as their sources.

Full size table

According to the proposed model as shown in Eq. (4), the parameters $\delta ,\rho , \omega ,$ and $P$ can be estimated by using the curve fitting technique based on minimizing sum of squared errors. Let ${e}_{i}^{2}$ be the squared error, ${y}_{i}$ be the actual cumulative normalized size from the smallest to the largest, ${\widehat{y}}_{i}$ be the estimated cumulative normalized size from the smallest to the largest, and $\mathrm{N}$ be the number of observations. The minimization of sum of squared errors $\left(min\sum_{i=1}^{\mathrm{N}}{e}_{i}^{2}\right)$ can be calculated as $min\sum_{i=1}^{\mathrm{N}}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$. That is, for any given ${x}_{i}$, where ${x}_{i}$ is the cumulative normalized rank of size from the smallest to the largest, we have to find the values of parameters $\delta ,\rho , \omega ,$ and $P$ such that $\sum_{i=1}^{\mathrm{N}}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}$ is minimized. Note that, given the estimated values of parameters $\delta ,\rho , \omega ,$ and $P$, ${\widehat{y}}_{i}$ is computed from Eq. (4) by plugging in ${x}_{i}$.

To evaluate how well the estimated Lorenz curves fit both the empirical and the hypothetical data, this study employs five goodness-of-fit statistics which are coefficient of determination (R²), mean squared error (MSE), mean absolute error (MAE), maximum absolute error (MAS), and information inaccuracy measure (IIM) developed by Theil³⁸ which can be computed as $\sum_{i =1}^{\mathrm{N}}{y}_{i}*{\mathit{log}}_{10}\left(\frac{{y}_{i}}{{\widehat{y}}_{i}}\right)$. The closer the value of R² is to 1 as well as the closer the values of MSE, MAE, and MAS are to 0, the better the estimated functional form. For the IIM criterion, the estimated functional specification that has a smaller absolute value of IIM is better than those with larger absolute values of IIM. In addition, we compare the performance of our proposed model to that of the SCS model which, according to Tanak et al.³⁷, has the best overall performance among a number of different functional forms employed in approximating the Lorenz curve. By using the same notations for the cumulative normalized rank of size $\left(x\right)$ and the cumulative normalized size $\left(y\right)$ as before and also letting $\gamma$, $\alpha$, and $\beta$ denote parameters, where $\gamma \ge 0, 0<\alpha \le 1,$ and $\beta \ge 1$, the SCS model is shown as Eq. (7).

$$y\left(x\right)={x}^{\gamma }{*\left(1-{\left(1-x\right)}^{\alpha }\right)}^{\beta }.$$

(7)

Note that the SCS model does not have an explicit mathematical solution for the Gini index. Therefore, we calculate the value of the estimated Gini index based on the SCS model $\left({\mathrm{Gini\,index}}_{\mathrm{SCS}}\right)$ by using the numerical integration. In addition to the SCS model, we compare the performance of our proposed model to that of the S model since it has four parameters and is shown to outperform other well-known parametric functional forms for estimating the Lorenz curve as discussed in Introduction. For notations, let $\lambda$, $\eta$, ${a}_{1}$ and ${a}_{2}$ denote parameters, where ${a}_{1}\ge 0,{a}_{2}+1>0,\eta *{a}_{2}+\lambda \le 1,$ $\lambda$ $\ge 0,$ and $\eta *{a}_{2}\ge 0$, the S model can be shown as Eq. (8).

$$y\left(x\right)= \left(1-\lambda +\eta \right)*x+\lambda \; {* \; x}^{{a}_{1}+1}-\eta *\left[1-{\left(1-x\right)}^{{a}_{2}+1}\right].$$

(8)

According to Sarabia²³, the S model has a closed-form expression for the Gini index which is shown as Eq. (9).

$${\mathrm{Gini \; index}}_{\mathrm{S}}=\lambda *\left(1-\frac{2}{2+{a}_{1}}\right) +\eta *\left(1-\frac{2}{2+{a}_{2}}\right).$$

(9)

This study uses the Microsoft Excel Data Analysis program and the Microsoft Excel Solver program, which are available in most, if not all, computers, for calculating the descriptive statistics as well as estimating the parameters and calculating the values of estimated Gini index. As suggested by Dagum³⁵, from a viewpoint of computational cost and the acceptance of the specified functional form in applied sciences, a simple method of parameter estimation is always an advantage. Table 2 reports the descriptive statistics of the data on sizes of events or things occurring in nature and society as well as the hypothetical data representing the case where one observation is larger in size (one person has income of 99 units) compared to the rest of observations which are smaller and equal in size (99 persons have an equal income of one unit).

Table 2 The descriptive statistics of the empirical data on sizes of events or things occurring in nature and society and the hypothetical data.

Full size table

Results and discussion

Table 3 reports the results of the estimated parameters based on the proposed model, the SCS model, and the S model. We first compare the performance of our proposed model to that of the SCS model. As shown in Table 4, the values of R² ranging between 0.9296 and 1.000 for the proposed model and between 0.8586 and 0.9994 for the SCS model suggest that all estimated Lorenz curves fit the empirical and the hypothetical data reasonably well. While both models perform equally well on the criteria of R², MAS, and IIM, our proposed model slightly outperforms the SCS model on the basis of MSE and MAE.

Table 3 The estimated parameters based on the proposed model, the SCS model, and the S model.

Full size table

Table 4 The evaluation of performance of the proposed model and that of the SCS model based on five goodness-of-fit statistics.

Full size table

Considering the earthquake intensity and the inter-state war intensity whose sizes contain zeros, the results, as reported in Table 4, indicate that our proposed model outperforms the SCS model in all five goodness-of-fit statistics while it outperforms the SCS model in three out of five statistical measures of goodness-of-fit for the metabolic degree. Moreover, our proposed model fits the hypothetical data perfectly well whereas the SCS model fits the hypothetical data relatively less well on the criteria of R², MSE, MAE, MAS, and IIM.

In total, there are 22 out of 40 cases where the proposed model outperforms the SCS model while there are 17 cases in which the SCS model performs better than the proposed model. There is one case which is the size of CEOs’ compensation where both models perform equally well on the basis of MAE. Thus, on the criteria of five statistical measures of goodness-of-fit, namely, R², MSE, MAE, MAS, and IIM, we can conclude that the performance of our proposed model is slightly better than that of the SCS model. Nonetheless, the proposed model is clearly superior to the SCS model when the data contain zeros and/or exhibit extreme inequality. Figure 2 shows the actual Lorenz plots of the data on sizes and their corresponding estimated Lorenz curves based on the proposed model and the SCS model.

Next, we report the performance comparison between the proposed model and the S model. The results are shown in Table 5. On the basis of R², MSE, MAE, MAS, and IIM, our proposed model outperforms the S model in 33 out of 40 cases. Focusing on the cases where the data contain zeros, the proposed model outperforms the S model for the earthquake intensity and the inter-state war intensity as measured by all five goodness-of-fit statistics while it outperforms the S model in three out of five goodness-of-fit statistics for the metabolic degree. In addition, the values of R², MSE, MAE, MAS, and IIM indicate that our proposed model fit the hypothetical data better than the S model. Figure 3 illustrates the actual Lorenz plots of the data on sizes and their corresponding estimated Lorenz curves according to the proposed model and the S model.

Table 5 The evaluation of performance of the proposed model and that of the S model based on five goodness-of-fit statistics.

Full size table

The overall performance comparison between the proposed model and the SCS model and that between the proposed model and the S model indicate that, on the basis of R², MSE, MAE, MAS, and IIM, our proposed model, by and large, is superior to the SCS model and the S model, especially when the data contain zeros and/or exhibit extreme inequality. For the data containing zeros, this can be demonstrated by the positive values of parameter $\delta$ as shown in Table 3 and also the horizontal-line segments of the estimated Lorenz curves for the earthquake intensity, the metabolic degree, and the inter-state war intensity as illustrated in Figs. 2a,d,g and 3a,d,g. In addition, for the data exhibiting extreme inequality where one observation has a larger size (one person has income of 99 units) than the others which have smaller and equal size (99 persons have an equal income of one unit), the proposed model would be able to perfectly fit the hypothetical data which can be demonstrated by the two positive-slope linear segments as shown in Figs. 2h and 3h while the SCS model and the S model fall short of this task.

Furthermore, our proposed model and the S model have a closed-form expression for the Gini index which can be conveniently computed by using Eqs. (6) and (9) as shown in “Methods”. The SCS model, however, requires the valuations of the beta and the gamma functions or the numerical integration in order to estimate the Gini index since its explicit mathematical solution for the Gini index does not exist. Table 6 reports the values of the estimated Gini index based on the proposed model, the SCS model, and the S model. Note that, for the SCS model, we calculate the estimated Gini index by using the numerical integration.

Table 6 The comparison of values of the estimated Gini index calculated based on the proposed model, the SCS model, and the S model.

Full size table

The results, as reported in Table 6, suggest that the values of the estimated Gini index calculated based on the proposed model do not significantly differ from those computed based on the SCS model and the S model except for the case of extreme inequality where the proposed model perfectly fits the hypothetical data which results in the value of estimated Gini index being identical to its actual value which is equal to 0.495. The estimated Gini index calculated based on the SCS model and the S model, however, are equal to 0.527 and 0.464 which differ from the actual Gini index by 0.032 and 0.031, respectively.

Even though one of the objectives for developing a model for the Lorenz curve is to calculate the Gini index that would be close to the actual observation, this study would like to note that, when assessing the performance of different parametric functional forms, the goodness-of-fit statistical measures, the shape of the estimated Lorenz curve, and the estimated Gini index should be taken into consideration altogether since there are infinite number of the Lorenz curves that could result in the same value of the Gini index. Thus, a good model for the Lorenz curve must be able to describe the shape of the distribution of size through changes in the values of parameters and the fact that it fits the actual data would be the main reason for its choice³⁵.

Conclusions

Given that previous studies have shown that no parametric functional form for the Lorenz curve is always optimal, different attempts therefore are still worth studying³⁹. This study introduces a universal model for the Lorenz curve with an explicit mathematical solution for the Gini index. By using the empirical datasets on sizes of events or things occurring in both nature and society from different disciplines of sciences, some of which contain zeros, as well as the hypothetical dataset created in order to represent the situation where one observation has a larger size than the rest of observations which have smaller and equal size, this study demonstrates that the proposed model fits not only the data whose actual Lorenz plots are convex but also the data whose actual Lorenz plots are both horizontal and convex practically well. It also fits the distribution of size where one observation has a larger size compared to the others which have smaller and equal size as characterized by the estimated Lorenz curve that has two positive-slope linear segments. To our knowledge, no study has proposed a parametric functional form for the Lorenz curve that could fit the data that have a typical convex segment, a horizontal-line segment and a convex segment, or two positive-slope linear segments before.

To evaluate the performance of our universal model for the Lorenz curve, this study compares the performance of the proposed model to those of the SCS model and the S model, both of which have been shown to outperform other well-known parametric functional forms for the Lorenz curve^23,37. The results indicate that the proposed model, by and large, is superior to both the SCS model and the S model on the criteria of R², MSE, MAE, MAS, and IIE, especially when the datasets contain zeros and/or exhibit extreme inequality in that one observation has a larger size than the rest of observations which have smaller and equal size. The other advantage is that the estimated Gini index based on our proposed model is more convenient to calculate than that computed based on the SCS model. This is because our proposed model has a closed-form expression for the Gini index whereas an explicit mathematical solution for the Gini index based on the SCS model does not exist and requires the valuations of the beta and gamma functions or the numerical integration. Moreover, when the dataset contains one observation which has a larger size than the rest of observations which have smaller and equal size, the estimated Gini index calculated based on our proposed model is identical to the actual Gini index whereas those calculated based on the SCS model and the S model are a bit of the mark.

Considering that the Lorenz curve and the Gini index are widely used in many scientific disciplines, we hope that our universal model for the Lorenz curve with a closed-form expression for the Gini index could be useful for analyzing the distributions of sizes and investigating their inequalities or unevennesses.

Data availability

All data analyzed during this study are publicly available and can be accessed from the sources listed in Table 1 and also in References. Note that while the original sources of data on the earthquake intensity, the solar flare intensity, and the metabolic degree are provided in Table 1 and in References, this study obtained all three datasets from Clauset et al.⁴⁰ who also use them in their study. These three datasets are available at https://aaronclauset.github.io/powerlaws/data.htm.

References

Newman, M. E. J. Power laws, pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2005).
Article ADS Google Scholar
Smith, F. A. et al. Body mass of late Quaternary mammals. Ecology 84, 3403. https://doi.org/10.1890/02-9003 (2003).
Article Google Scholar
Huss, M. & Holme, P. Currency and commodity metabolites: Their identification and relation to the modularity of metabolic networks. IET Syst. Biol. 1, 280–285 (2007).
Article CAS PubMed Google Scholar
The American Federation of Labor and Congress of Industrial Officers. Highest-paid CEOs. https://aflcio.org/executive-paywatch/highest-paid-ceos (2022).
Sitthiyot, T. Annual salaries of the athletes from 11 professional sports (V1). Mendeley Data https://doi.org/10.17632/6pf936739y.1 (2021).
Article Google Scholar
Sarkees, M. R. & Wayman, F. Resort to War: 1816–2007 (CQ Press, 2010).
Book Google Scholar
Lorenz, M. O. Methods of measuring the concentration of wealth. Pub. Am. Stat. Assoc. 9, 209–219 (1905).
Google Scholar
Eliazar, I. I. & Sokolov, I. M. Measuring statistical evenness: A panoramic overview. Physica A 391, 1323–1353. https://doi.org/10.1016/j.physa.2011.09.007 (2012).
Article ADS Google Scholar
Chotikapanich, D. A comparison of alternative functional forms for the Lorenz curve. Econ. Lett. 41, 129–138 (1993).
Article MATH Google Scholar
Kakwani, N. C. & Podder, N. On the estimation of Lorenz curves from grouped observations. Int. Econ. Rev. 14, 278–292 (1973).
Article Google Scholar
Kakwani, N. C. & Podder, N. Efficient estimation of the Lorenz curve and associated inequality measures from grouped observations. Econometrica 44, 137–148 (1976).
Article MATH Google Scholar
Kakwani, N. C. On a class of poverty measures. Econometrica 48, 437–446 (1980).
Article MathSciNet MATH Google Scholar
Rasche, R. H., Gaffney, J. M., Koo, A. Y. C. & Obst, N. Functional forms for estimating the Lorenz curve. Econometrica 48, 1061–1062 (1980).
Article MATH Google Scholar
Aggarwal, V. On optimum aggregation of income distribution data. Sankhyā B 46, 343–355 (1984).
Google Scholar
Gupta, M. R. Functional form for estimating the Lorenz curve. Econometrica 52, 1313–1314 (1984).
Article MathSciNet MATH Google Scholar
Arnold, B. C. A class of hyperbolic Lorenz curves. Sankhyā B. 48, 427–436 (1986).
MathSciNet MATH Google Scholar
Rao, U. L. G. & Tam, A.Y.-P. An empirical study of selection and estimation of alternative models of the Lorenz curve. J. Appl. Stat. 14, 275–280. https://doi.org/10.1080/02664768700000032 (1987).
Article ADS Google Scholar
Basmann, R. L., Hayes, K., Slottje, D. & Johnson, J. A general functional form for approximating the Lorenz curve. J. Econom. 92, 727–744 (1990).
MathSciNet Google Scholar
Ortega, P., Martín, G., Fernández, A., Ladoux, M. & García, A. A new functional form for estimating Lorenz curves. Rev. Income Wealth 37, 47–452 (1991).
Article Google Scholar
Ogwang, T. & Rao, U. L. G. A new functional form for approximating the Lorenz curve. Econ. Lett. 52, 21–29 (1996).
Article MATH Google Scholar
Ogwang, T. & Rao, U. L. G. Hybrid models of the Lorenz curve. Econ. Lett. 69, 39–44 (2000).
Article MATH Google Scholar
Ryu, H. & Slottje, D. Two flexible functional forms for approximating the Lorenz curve. J. Econom. 72, 251–274 (1996).
Article MATH Google Scholar
Sarabia, J. M. A hierarchy of Lorenz curves based on the generalized Tukey’s lambda distribution. Econom. Rev. 16, 305–320 (1997).
Article MathSciNet MATH Google Scholar
Sarabia, J. M., Castillo, E. & Slottje, D. An ordered family of Lorenz curves. J. Econom. 91, 43–60 (1999).
Article MathSciNet MATH Google Scholar
Sarabia, J. M., Castillo, E. & Slottje, D. An exponential family of Lorenz curves. S. Econ. J. 67, 748–756 (2001).
Google Scholar
Sarabia, J. M. & Pascual, M. A class of Lorenz curves based on linear exponential loss functions. Commun. Stat. Theory Methods 31, 925–942 (2002).
Article MathSciNet MATH Google Scholar
Rohde, N. An alternative functional form for estimating the Lorenz curve. Econ. Lett. 105, 61–63 (2009).
Article MathSciNet MATH Google Scholar
Helene, O. Fitting Lorenz curves. Econ. Lett. 108, 153–155 (2010).
Article Google Scholar
Sarabia, J. M., Prieto, F. & Sarabia, M. Revisiting a functional form for the Lorenz curve. Econ. Lett. 107, 249–252 (2010).
Article MathSciNet MATH Google Scholar
Sarabia, J. M., Prieto, F. & Jordá, V. About the hyperbolic Lorenz curve. Econ. Lett. 136, 42–45 (2015).
Article MathSciNet MATH Google Scholar
Wang, Z. & Smyth, R. A hybrid method for creating Lorenz curves. Econ. Lett. 133, 59–63 (2015).
Article MathSciNet MATH Google Scholar
Sarabia, J. M., Jordá, V. & Trueba, C. The lame class of Lorenz curves. Commun. Stat. Theory Methods 46, 5311–5326 (2017).
Article MathSciNet MATH Google Scholar
Paul, S. & Shankar, S. An alternative single parameter functional form for Lorenz curve. Empir. Econ. 59, 1393–1402. https://doi.org/10.1007/s00181-019-01715-3 (2020).
Article Google Scholar
Sitthiyot, T. & Holasut, K. A simple method for estimating the Lorenz curve. Humanit. Soc. Sci. Commun. 8, 268. https://doi.org/10.1057/s41599-021-00948-x (2021).
Article Google Scholar
Dagum, C. A new model of personal income distribution: Specification and estimation. In Modeling Income Distributions and Lorenz Curves. Economic Studies in Equality, Social Exclusion and Well-being Vol. 5 (ed. Chotikapanich, D.) 3–25 (Springer, 1977). https://doi.org/10.1007/978-0-387-72796-7_1.
Chapter Google Scholar
Elsner, D., Meusemann, K. & Korb, J. Longevity and transposon defense, the case of termite reproductive. Proc. Natl. Acad. Sci. U.S.A. 115, 5504–5509. https://doi.org/10.1073/pnas.1804046115 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Tanak, A. K., Mohtashami Borzadaran, G. R. & Ahmadi, J. New functional forms of Lorenz curves by maximizing Tsallis entropy of income share function under the constraint on generalized Gini index. Phys. A 511, 280–288 (2018).
Article MathSciNet MATH Google Scholar
Theil, H. Economics and Information Theory (North-Holland, 1967).
Google Scholar
Fellman, J. Income inequality measures. Theor. Econ. Lett. 8, 557–574 (2018).
Article Google Scholar
Clauset, A., Shalizi, R. S. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009).
Article ADS MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to Dr. Suradit Holasut for guidance and comments.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Department of Banking and Finance, Faculty of Commerce and Accountancy, Chulalongkorn University, Mahitaladhibesra Bld., 10th Fl., Phayathai Rd., Pathumwan, Bangkok, 10330, Thailand
Thitithep Sitthiyot
Department of Chemical Engineering, Faculty of Engineering, Khon Kaen University, Mittapap Rd., Muang District, Khon Kaen, 40002, Thailand
Kanyarat Holasut

Authors

Thitithep Sitthiyot
View author publications
You can also search for this author in PubMed Google Scholar
Kanyarat Holasut
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.S. conceived the study. T.S. designed the methodology and performed the analysis. K.H. validated the results. T.S. wrote the main manuscript text. T.S. and K.H. reviewed and edited the main manuscript text. Both authors reviewed the manuscript.

Corresponding author

Correspondence to Thitithep Sitthiyot.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sitthiyot, T., Holasut, K. A universal model for the Lorenz curve with novel applications for datasets containing zeros and/or exhibiting extreme inequality. Sci Rep 13, 4729 (2023). https://doi.org/10.1038/s41598-023-31827-x

Download citation

Received: 08 July 2022
Accepted: 17 March 2023
Published: 23 March 2023
DOI: https://doi.org/10.1038/s41598-023-31827-x

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.