## Introduction

Menopause, a term to denote the cessation of menstruations, signs a major life transition with profound effects on a woman’s health. Apart from being a biological threshold for human reproduction, it seems to have a really tight biological-age interval in which it must occur. Women having their menopause outside that interval have an increased risk for several health problems, namely type 2 diabetes1 and some types of cancer. Later Age at Menopause (AaM) has been associated with an increased risk of breast cancer2 and an early onset of menopause associated with an increased risk of several other diseases such as: cardiovascular3, neurological, psychiatric, osteoporosis, and other sequelae4. On the other hand, longer overall survival and greater life expectancy are associated with later AaM5,6.

Menopause is influenced by biological, environmental, socio-demographic and lifestyle factors and may be delayed for a woman that had an early menarche, longer and regular menstrual cycles, use of contraceptive hormones, had multiple births and breastfed7,8,9,10. In contrast, it may be accelerated by residence in an urban community11, smoking, over weight, alcohol consumption, malnutrition during infancy or adolescence, poor educational level, living more than 3000 m above sea level and, in general, living in underdeveloped countries12,13.

In ancient age some reports talk about the onset of menopause around 40 years old. Estimates for the middle age until the nineteenth century point to an AaM of about 45 years old14. Nowadays a mean AaM lower than 48 years has been found only in a few populations (e.g. Philippines, New Guinea15 and Mexico7). According to two studies16,17 involving women aged 45 and older in the 80’s and 90’s of the 20th century, in western countries, the median AaM has been observed to vary between 50 and 52 years old. The European Prospective Investigation into Cancer & Nutrition (EPIC) study found that mean AaM between 1991 and 2007 in 26 centers from 8 of 10 EPIC countries, was 48.6 years, and only 4.8% of women had a premature menopause1, i.e. before 40 years of age. An early menopause refers to menopause that occurs at or before age 45, below the median age of natural menopause (age 51 years)4.

The Decisions at Menopause Study (DAMES)18,19,20 conducted in Lebanon, Morocco, Spain and USA, with women aged 45–55 born in the period 1940 to early 1950’s, estimated that the median AaM were, respectively, 49.3, 48.4, 51.7 and 52.6. In a cross-sectional survey across Poland in 2000–2004 with women aged between 35 and 65, the overall median AaM was found to be 51.25 years; the 25th percentile was 49 years and the 75th percentile was 54 years21.

Contrary to what happens with the age at menarche, the AaM is not so well documented. There is a lack of research on its temporal trends. Some point to an increasing age worldwide22,23. Others, however, report that there is no conclusive evidence for such an increase15,24,25. We did not find any study suggesting a negative trend in AaM. The study26 of the Chuvasha population in a rural region of the Russian Federation point to a mean value of AaM increasing from 47.0 (in women born during 1920–1925) to 49.7 (women born during 1940–1945) and 49.3 (born during 1945–1950). An increase in the average AaM was also observed for a population of women residents of New Hampshire, Massachusetts, USA, born between 1910 and 1969, from 49.1 years in the 1915–1919 cohort to 50.5 years in the 1935–1939 birth cohort27.

Official statistics show that Portugal is one of the most aged countries in the world28,29, i.e. with one of the highest percentages of older people per new born, thus the concerns with the ageing of the population and well being and clinical consequences associated with menopause highlight the importance of research designed to identify age variation in natural menopause and factors predicting women’s transition to menopause in order to have better public health policies30. The Portuguese panorama is identical to those already described above. A work31 that investigates the reproductive period and fertility in a rural Portuguese municipality (Oleiros) claims that the mean AaM for that municipality was 48.69 years for a woman born in 1880–1940. Despite being focused on a specific-municipality that work has the advantage, for our study, of analyzing a municipality that is part of our own study, during a time-period that precedes the time period analyzed in this work and is likely to have some overlapping women.

A typical regression analysis, in most contexts, focuses on explaining the expected value dependency of a response variable as a function of a set of explanatory variables. Although in some situations, if we want to understand not only the behavior of the average person, but also the behavior of those who belong to the extremes of the population, we might consider percentile regression32,33 that allows us to go beyond mean regression, enabling building regression curves for the percentiles, instead of the mean of a response variable. Since the starting point for this work was the idea of understanding the spatio-temporal evolution of menopause in the central region of Portugal and considering the potential impact of early and/or late menopause on a women’s health, this study will analyse a dataset from the Breast Cancer Screening Program described in the section Data Description below, considering a distributional regression approach. As Rigby et al.34 argue, for data with more than 1000 observations, regression models beyond the mean should be the norm, not the exception.

## Data description

The study of the AaM percentiles in the central region of Portugal was carried out considering a dataset on the Breast Cancer Screening Program provided by the Portuguese Cancer League (LPCC) in the Central region of Portugal in the period 1990–2018 which has screened 452,342 women born in the period 1900–1973. At the age of 45, or 50 after 2010, all women in each of the 89 municipalities (see Fig. 3) are invited to have a free screening mammogram and every 2 years thereafter until the age of 69. This region roughly represents 25% of the Portuguese population. More details about the screening program and the inclusion criteria are given elsewhere35,36,37. Age at menopause, relying solely on the correctness of a woman’s memory, was defined as the self-reported age at the last menstrual period and no information was given on whether menopause was naturally occurring or induced. We stress that as women are screened several times during their life about their AaM, and given that sometimes the answer does not coincide with the last ones, for purposes of analysis we consider the most frequent answer (age) given by that particular woman.

For statistical analysis purposes we decided to withdrawn women born before 1910 because they were only 139, which is considered to be a small sample size for a centile regression. We also decided not to analyze the AaM of women born between 1961 and 1973 because this cohort is still quite incomplete, with a very high percentage of women who have not yet reported having reached menopause. Furthermore, an initial analysis of the database with these women included revealed some inconsistencies in the results. Additionally, due to missing values for the AaM we will be considering only the complete cases for the period 1910–1960, i.e. we exclude all the women without an observed menopause. Obviously, we are aware of the inherent statistical problems of such truncation of this dataset, because this might lead to biased results, as we argue in the “Discussion” section at the end of the article. Namely: (i) women removed from the calculations are mostly the youngest and they might have a different life story (i.e. influencing covariates) than the oldest, what could be reflected in their, possible, different menopause timings; and (ii) women having the most recent menopause observed are young with early or even premature menopause. These two characteristics have the potential of shrinking the estimates for the AaM towards lower values. Four women were also removed because they reported a menopause age above 69, which is considered unrealistic27, and one girl due to have reported a menopause age below 15. A premature ovarian insufficiency affects 10 girls per 100,000 person-years for girls of 15–29 years of age38.

In the end we were left with a sample size of $$N=320,444$$ women who have reached the menopause. Age at menopause ($$\texttt {AaM}$$) registered in years at the first interview, was the response variable of interest. The average was 48.72 with a minimum of 15 and a maximum of 69. Figure 1 shows the histogram for the unconditional distribution of this variable with a modal class at 48–50 years old. Summarized in Table 1 are its descriptive percentiles by period since 1910 to 1960 and Table 4 depicts the mean and the median by municipality ($$\texttt {Muni}$$) and birth year ($$\texttt {Byear}$$). In Table 5 we present the percentage of women with an observed menopause per municipality since 1910 to 1960 in our data set.

## Methods

A typical regression analysis has the advantage of being very well know to the users, although it is very likely that in many scenarios other properties of the response distribution (e.g. the variance) may also depend on covariates. Besides this, one may want to comprehend, not only the behavior of the average person, but the behaviour of those that belong to the population’s extremes. To account for these features, we will consider a statistical framework developed within the context of Generalized Additive Models for Location, Scale and Shape (GAMLSS)39, also known as distributional regression models40. These models have several advantages, but for our work the principal characteristic is that it assumes a known parametric family of distributions for the response variable which allows an easy calculation of the conditional percentiles curves of the AaM given the birth year and the municipality of residence. Additionally, the effects of covariates that we are interested in can have flexible forms (e.g. smoothing functions) not being restricted to the traditional, and several times unrealistic, linear effect. At the same time it ensures that the adjusted percentile curves do not cross. A competitor method is quantile regression41, where the adjusted curves might cross, because it does not assume a distribution for the response variable and, therefore, can be considered in the realm of non-parametric methods34. Other authors have been considering other approaches. Most use simple statistical methods like ANOVA42 or linear regression26. Frommer22 considers a probit model, Reynolds and Obermeyer20 a logit regression and a survival analysis is used by Shanley and Kirkwood43 and Kaczmarek21.

### Statistical models

Based on data for $$i = 1, \ldots , N=320,444$$ women, we will assume conditional independence of the individual ages at menopause, AaM, given the covariates Byear and Muni. An initial exploratory analysis showed that the distribution of the AaM is left (negative) skewed (vide Fig. 1) and a flexible approach based on splines will allow us to capture any (possible) nonlinear effect of the predictor Byear on the AaM. This assumption do not precludes a possible simple linear relationship between those two variables, but it is rather a parsimonious assumption. Taking this into consideration we decided to analyze the data with a model within the aforementioned class—GAMLSS. This method permit an highly flexible approach, because we are no longer constrained to the traditional distributional assumptions, like normality for example, and we are free to utilize skewed distributions without the need of transforming the data, allowing us to work on the scale of the data which is a very important feature. Additionally, as already said, all the parameters of the response probability distribution can be modeled by explanatory variables and not only the location. For instance, it is possible to model the parameters related to the scale and shape (for some distributions it is also possible to model the kurtosis). All the explanatory variables are introduced into the model parameters through predictors, which can be linear functions of the explanatory variables or can take the form of structured additive predictors with non linear or smoothing functions of explanatory variables.

The generalized Akaike Information Criterion (GAIC), a models selection measure, was considered to select the best fitting distribution of the data. It was found that the Box–Cox Cole and Green distribution, $$\text {BCCG}(\mu ,\sigma ,\nu )$$, provides the lower GAIC value (i.e. the best fit) when compared to other alternative distributions. Additionally, it was found that modeling the parameters $$\mu$$ and $$\sigma$$ as functions of both the birth year and the municipality also improves the GAIC value (i.e lowering it) when compared to a model where these parameters are independent of those covariates. Taking that into account the main statistical model for analyzing the data within the GAMLSS framework that we will be dealing with is:

\begin{aligned} \texttt {AaM}&\sim \text {BCCG}(\mu ,\sigma ,\nu ), \nonumber \\ \mu&= \beta _{10} + f_{11}(\texttt {Byear}) + f_{12}(\texttt {Muni}), \nonumber \\ \log {(\sigma )}&= \beta _{20} + f_{21}(\texttt {Byear}) + f_{22}(\texttt {Muni}), \nonumber \\ \nu&= \beta _{30}. \end{aligned}
(1)

We defined an additive model for the location parameter, $$\mu$$, which for this distribution represents its median, a very important parameter within our work, as we are interested in estimate the percentiles, so the linear predictor that we defined, $$\mu = \beta _{10} + f_{11}(\texttt {Byear}) + f_{12}(\texttt {Muni})$$, allows us to directly model the median. For the scale of the response variable, $$\sigma$$, we let the parameter depend on the explanatory variables Byear and Muni. It is a multiplicative model resulting from the log-link, which in turn ensures positive values for the parameter. For this distribution the scale parameter is approximately the coefficient of variation. The shape parameter, $$\nu$$, is modelled only with an intercept term. It was found that adding covariates to its linear predictor did not improve the adjustment.

The functions $$f_{11}$$ and $$f_{21}$$, modelled as cubic splines, represent the temporal effects of the year of birth, and $$f_{12}$$ and $$f_{22}$$ are the spatially correlated effects of the residence municipalities modeled as an intrinsic autoregressive process (IAR), a limiting case of the conditional autoregressive models (CAR) of Besag et al.44, because we are considering that the spatial effect is a Markov random field (MRF), i.e. we assume that the spatial random effects (our spatial variables) have a joint distribution which is specified by considering conditional independence locally. The IAR models are a typical choice when dealing with a dependent variable observed in geographical areas sharing borders, because we are expecting neighboring areas to have more similar observed values for the AaM than areas far apart. The consequence of this approach is that the parameter estimates for neighboring locations are shrinked towards its mean.

### Ethics approval

This project has used data from the Breast Cancer Screening data from Portugal, which have ethical approval from the Faculty of Psychology of the University of Coimbra Ethics Committee and the Portuguese Cancer League.

### Consent to participate

Usage of data derived from the records is according to Portuguese and European laws and regulations. All women signed the informed consent prior to the screening procedure.

## Results

The conditional percentile values were thus obtained within the R software45, namely using the main package gamlss (version 5.1-4) and two more packages that provide a set of functions to fit models with spatial variables (gamlss.spatial and gamlss.add). Below the chosen model is written in terms of the R syntax based on the gamlss package:

where mrf means Markov Random Field and xt1 conveys the information about the spatial neighborhood structure considered for our application. Specifically, we considered the so-called first-order neighborhood structure, i.e municipalities that share a boundary are considered neighbors in the central region of Portugal.

Model diagnostics were based on the normalized quantile residuals as defined by Ref.46, p. 418. A summary of these residuals is listed in Table 2. We can see that, they have a nearly zero mean, a variance nearly one, a coefficient of skewness near zero and a coefficient of kurtosis near 3. These are all typical characteristics when the residuals are approximately normally distributed with zero-mean and variance equal to 1. From here we can conclude that the selected model is adequate.

### Age at menopause—temporal trends

A first distributional regression model without covariates was considered in order to estimate the median AaM for the overall population since 1910, resulting in a point estimate of 49.38 years old. Then we limited the model to one explanatory variable Byear for exploring the temporal trends in the central region of Portugal in AaM (Fig. 2). It should be noted that the percentiles curves displayed are not linear. This behavior could not be captured within a typical linear regression analysis and is facilitated by the non-linear approach permitted by the generalized additive models. Additionally, the estimated percentiles 5%, 10%, 25% and 50% are steadily increasing at a greater rate since 1920, than the estimated percentiles 75%, 90% and 95%, which could mean that there is no much more space for increases in later menopause ages. Globally, most of the estimated percentile curves fit quite well to the observed data. The exception is the 95% percentile curve which deviates from the observed 95% percentiles about 1.5 years.

### Age at menopause—spatio-temporal trends

Each municipality is likely to have the median and the variability of the AaM close to those of its neighbours. Bearing this in mind, spatial effects, $$f_{12}(\texttt {Muni})$$ and $$f_{22}(\texttt {Muni})$$, were added to the model in (1) aiming at capturing the influence of neighbouring locations not available by the observed covariates. Figures 3 and 4 present the spatio-temporal trends in the percentiles for the AaM, which are already accounting for the spatial dependency of the response variable (AaM). It is clear that the patterns through the decades have remained unchanged, i.e. regions with the larger values in 1910 maintain its position in 1960, despite all have being increasing, with only a few exceptions.

Figure 5 depicts the effects of each of the covariates (Byear and Muni) on the AaM location parameter, i.e. its median.

The interpretation is relatively straightforward. An increasing effect of the Byear was found. Additionally, the effect is positive only for those woman born in the period 1942–1960, meaning that a woman born in this period tends to have an AaM above the median for the overall population and accounting all the years, i.e. 49.38. For the spatial effects, a municipality with a negative effect means that women residing there, and controlling for the year of birth, have an AaM below the median when comparing to the overall population. On the other side, areas with a positive spatial effect will tend to have AaM above the median of the overall population. Looking to Figs. 3 and 4 and comparing it to Fig. 5 it is clear that larger values of the median AaM are associated with a positive spatial effect and vice-versa.

Another interesting feature of the temporal effect for the year of birth is obtained considering the getPEF function (Partial Effect function) which allows us to calculate the slope of the curve at each year (vide Table 3). For example for the year 1920, in gamlss syntax this can be done by

and the result is the slope of the curve’s tangent for the year 1920, which is not very different from the remaining years. The result is approximately 0.044, meaning that for that year the median AaM was increasing at a rate of $$0.044 \times 365\approx 16.06$$ days, or 5.35 months per decade.

The log-scale parameter, $$\log (\sigma )$$, shows a decreasing trend as a function of the Byear (Fig. 6), thus larger values of this covariate imply a reduction of the variability in the AaM estimates.

## Discussion

The dataset on menopause here analyzed is based on the largest sample size reported until the date in Portugal, although, and for obvious reasons, it lacks women born in the last decades, and because of that we can not ascertain the percentiles for those youngest women. Yet, and to the best of our knowledge, this work is the first in analyzing and producing percentiles curves for the women residing in the center of Portugal since 1910 and in using a distributional regression approach. Indeed, the time of follow-up of each woman, which can goes up to 24 years, and the time-period, nearly 30 years (1990–2018) are also noticeable. As pointed in Ref.23 to study a secular trend in AaM one must have a large population followed over a period of 30 or more years.

Menopause is an imperfect measure of what is, essentially, a long transition noted in retrospective, of something familiar, until the cessation. As mentioned by Sievert47, perhaps it is best to only trust the responses of women aged 60 years or younger, because assessment of AaM is based on self-report, which is prone to recall bias, particularly in older women. However, other studies have shown that the validity and reproducibility of self-reported AaM are fairly good48. Our study fulfills this point of view, because all women start to be asked whether or not they already reached the menopause by the age of 45. Sievert47 also argues that median AaM, as computed by probit analysis, is the only measure of central tendency that does not rely on memory. In this regard we consider that our distributional regression approach has several advantages and improvements over that more traditional and somehow outdated technique.

We resorted in a distributional regression model to estimate several spatio-temporal percentiles for the AaM in the central region of Portugal considering a long period of time and using a representative sample with a huge number of observations. Results show that, within the period 1910–1960, most percentiles obtained for the AaM have an increasing trend. For a girl born in 1920 we would expect a median AaM around 47.53 years, and for a girl born in 1960 that value is greater by about 2 years (Fig. 2), excluding any effects of other covariates. In the middle of the 20th century the increasing rate achieved its maximum of around 20 days per year (Table 3). Interesting, is the pattern of the AaM distribution by municipality. Despite all its percentiles are increasing, the regions with the highest percentile values (respectively, lowest) for AaM in 1920 are the same for 1960 (Figs. 34), with only a few exceptions. Early and premature menopause (below percentile 5%) occur in the interior north (north-eastern) and in a vertical (north to south) band close to the coast, but not containing the municipalities with a sea border (western) of Portugal’s central region for all the decades considered. Later menopause (above percentile 95%) occur predominantly in the central-north and central-south areas.

If we compare our estimate for the municipality of Oleiros with that obtained by Ref.31, it becomes clear that AaM is increasing over time. From their estimate of 48.69 for a women born between 1880 to 1920, our estimate for the 1950’s shows an increase from about 1 year to about 49.6. (vide Fig. 4 where this area is highlighted.)

The work of Duarte et al.36 have already analyzed this data set, but only concerning women who had a mammogram until 2007. In this work we have 11 years of subsequent observations, so a good comparison between both periods is possible. Those authors, in a scenario of complete case analysis, stated that women born after the first world war, are having their menopause at lower ages. With this additional years of data that we have, it is definitely clear that we are facing an increasing trend. In this point our findings agree with Dratva et al.49, which claim that AaM is shifting towards higher ages.

We must stress that some of the spatial effects may not being correctly captured, because the spatial information that the dataset enclosures is about the place of living at the time of the participation in the screening program, i.e. when women are adults. Although, we assume that these women have been living there always. This assumption is more likely to be true for the interior areas (east), which have at the same time, lower levels of wealth, than for the areas closest to the coast (highest levels of wealth).

The probability distribution that we chose for describing the AaM, $$\text {BCCG}(\mu ,\sigma ,\nu )$$, has several advantages for have being considered in this work. First its location parameter, $$\mu$$, is interpreted as the median of the distribution, which in a work looking for describing the percentiles is a benefit. Additionally, we are free to control the shape parameter, $$\nu$$, representing the amount of skewness of the distribution and from Fig. 1 we know that our distribution is left skewed. Estimation of the percentiles curves for a response variable is widely used in medicine for checking whether an individual have an abnormally low or high value of the response variable (given the covariates of interest), and hence whether she/he is potentially at risk. For example, a specific country’s region may be the target of differentiated public policies if it is known that girls living there tend to have an early or late menopause. For example, the North American Menopause Society, the British Menopause Society, and the International Menopause Society recommend oestrogen replacement therapy for women with premature or early menopause until at least around the median age of natural menopause (approximately age 51 years)50,51,52.

This work does not consider the possible impact of an adnexectomy in the AaM simply because in the screening program women are not asked about that. Although, and considering that the possible effects of an adnexectomy include an early menopause, the potential impact of those women in the estimated AaM in this work would possibly be in lowering the estimates obtained for the AaM.

Education is known to be, to a certain extent, a surrogate for wealth and lifestyle behaviors, and consequently, has we already said in the introduction, older AaM have been reported to be potentially associated with a higher level of education. Unfortunately, this work did not targeted this relation. During the analyzed period 1910–1960 women in Portugal had mainly domestic and agriculture related works and the population were living mostly in rural areas. There were only 3 years of mandatory schooling but many families, mainly in the countryside, just sent their children to school long enough to be able to read. In 1930 68% of the Portuguese were illiterate. In 1950 only 6.3% of the kids aged 10–11 completed the mandatory 3 years of primary school53,54. In any case, the difference between being able to read or not was sufficient to have a job outside the subsistence farming and everything that could come of it. The number of newborns per woman was 3.2 children on average.

In 1960 economic conditions in Portugal were obviously better than in 1910 but not yet similar to the rest of Europe. After world war II the individual income start to grow but slowly, and after 1960, a period not covered in this work, the country experienced a rapid increase of the individual income until the early 70’s55. In the coast of Portugal (municipalities in the left of the maps) where the largest cities were concentrated and consequently regions with higher economic conditions, we can see from Figs. 3 and 4 that the AaM is higher than in the interior (municipalities in the right of the maps) where the population mostly lived from the agriculture. Across the country, serious food insecurity problems were faced. In particular, among those who lived in areas whose economic conditions were more unfavorable, as is the case of the central area of Portugal presented here and mainly in its eastern municipalities. This also led to migration to coastal municipalities and emigration, mainly to Brazil, France and Germany (Tables 4, 5).