Introduction

A sufficient supply of clean water is necessary for life and is also important for the overall wellness of ecosystems, communities, and the economy1,2,3,4. A healthy society is built on a reliable supply of clean water5. Access to clean water is a fundamental human right that must be met by everyone to satisfy their indoor and outdoor water demands6. The provision of clean water is widely regarded as being at the heart of all sustainable development goals7,8. However, the process of obtaining clean water is lagging behind the sustainable development goals, particularly in developing nations9. The growing water challenges emanate from various backgrounds such as relying on remote water supply systems rather than searching for flexible and alternative sources10, the conception of water as a non-limited natural resource, and the “only one option” supply system11, unwise water conservation practices12, ignoring internal alternative water source harvesting following the linearity of urban water flows13 and lack proactive water centric planning and management practices14,15. According to projections, by 2050, domestic water consumption will rise by 130%, and manufacturing will grow by 400%, causing a 55% increase in water demand16.

The provision of clean water in developing nations is a major concern for several reasons. First, a lack of access to clean water is one of the main factors contributing to illness and death in developing countries17. Waterborne infections including cholera, typhoid, and diarrhea are major causes of death and illness in developing countries18. Second, women and children in many developing countries are obligated to give up prospects for employment and income sources by walking for hours each day to acquire water19. Thirdly, the rapid and unplanned urban growth in developing countries has exacerbated the demand for water, imposing additional stress on water infrastructure20. Last but not least, sectorial competition for clean water in developing nations is steadily expanding, necessitating particular water conscious strategies21.

One of the water consumption sectors that demands emphasis is residential water consumption. Residential water consumption, often referred to as domestic water consumption, is the total amount of clean water used for indoor and outdoor activities such as drinking, cooking, bathing, washing dishes, flushing toilets, watering plants, and utilizing swimming pools22. There is mounting evidence that actions emphasizing wise urban residential water consumption are necessary to minimize the worsening drinking water supply problem23,24,25. As a result, there is currently a growing understanding that water sensitive intervention frameworks can be used to address the expanding urban water problems13,26.

According to ref. 27, “water sensitive” is “a vision for urban water management that calls for shifting urban water systems away from a focus on water supply and wastewater disposal and towards more flexible systems that integrate various water sources, operate through both centralized and decentralized systems, deliver a wider range of services to communities, and integrate into urban design.” Ref. 27 named three fundamental principles that guide “water sensitive city”. The first principle is known as “city as water supply catchments”. This recognizes the significance of diverse water sources in urban areas. This principle advocates for the adoption of appropriate water usage practices to avoid solely relying on one source of water. The second principle is known as “cities providing ecosystem services”. This focuses on delivering multiple services to the ecosystem, and the third deals with the principle of “water sensitive communities”. The idea of “water sensitive communities” strives to create a culture that is water mindful society.

It can be said that water sensitive intervention is a strategic tool to achieve the vision of a “water sensitive city.” Ref. 13 defined “water sensitive intervention” as a means of enhancing water use efficiency and diversifying water sources through water sensitive city principles. Water sensitive intervention is explained by two ideas: supply internalization and supply diversification. Supply internalization, which lessens dependency on water obtained through a remote source, is the capacity to supply water within one’s urban boundary system. Supply diversification, on the other hand, focuses on offering alternative water sources, such as surface water, groundwater, roof water harvesting, and recycling26. In the simplest terms, water sensitive intervention can be linked with finding a reliable, accessible, quality water, fixing huge leaks, controlling and enforcing restrictions on unauthorized connections, and looking for other alternative water sources28,29.

To create efficient water conservation strategies and have an influence on future water sensitive planning, water authorities, and policymakers should pay special attention to where, how much, and how households consume water23,30. In conjunction with this premise, tracking and tracing residential water consumption is essential to address the underlying issues since it generates the baseline data needed for strategic decision support13. Additionally, identifying the variables that influence residential water consumption also makes it feasible to gain a deeper understanding of water consumption24,31. In this regard, studies of the residential water consumption pattern and its influencing variables have been conducted. Notwithstanding the above advancements, the majority of the current state of the art frequently has certain shortcomings. For instance, refs. 23,32 claim that earlier studies had only taken into account one or a few categories of variables without integrating them all into a broader context. Significant empirical results also reported that there is an imbalance in water consumption researches between developed and developing nations. Besides, in developing nations, there is a lack of knowledge about the factors that influence residential water consumption25,33. Academic scholars have suggested that measuring residential water consumption and figuring out the factors that affect water consumption are crucial inputs for building a waterwise urban centers23,34,35.

There are numerous physical, socio-demographic, and climatic variables that influence residential water consumption23,36,37. The factors that influence residential water consumption are complex and multifaceted, with variations based on a person’s socioeconomic background, local climate, lifestyle preferences, and psychological characteristics38,39,40. The desire to learn more about the specific cultural, technological, social, and climatic factors that explain the factors that influence water consumption has grown in recent decades23,32,41.

Ref. 42 examined the connections between urbanization and residential water consumption, using the metropolitan area of Barcelona as a case study. According to the researchers’ findings, income, housing type, household size, type of outdoor services, and the plant types in the garden are statistically significant in influencing residential water consumption. Ref. 43 also examined per capita water consumption in Sirte, Libya, using the multiple linear regressions (STEPWISE) method. In their research, they found that, interestingly, the per capita water consumption decreased as family size increased and household income had no impact on water consumption.

Ref. 44 examined Johannesburg’s residential water consumption trends and the adoption of water efficient technology using probit regression models. According to this research, older, male and low-income households are more likely to consume water efficiently. Ref. 45 used descriptive statistics and inferential statistics to analyze the variables affecting residential water consumption. According to their findings, socioeconomic and demographic variables have the greatest influence on residential water consumption. They found that age, gender, and households headed by women all significantly influence residential water consumption. Ref. 46 employed artificial neural networks and principal component analysis to identify factors that influence residential water consumption in a semi-arid environment. Their result indicated that residential water consumption is influenced by family size, monthly income, housing unit type, housing size and number of rooms.

Ref. 47 looked at the water consumption trend in public pipes in Ghana. They discovered that people consume less water during the rainy season, suggesting that rainwater may be used as a substitute water source. Furthermore, their research shows that the presence of streams and hand dug wells causes even higher reductions in consumption. Ref. 32 reviewed factors that influence household water consumption and concluded that the factors can be categorized under economic, socio-demographic, physical, technological, climatic, and spatial factors. They also acknowledged that determinants in all studies have not led to consistent results. For instance, ref. 25 discovered that households with big family sizes need more water even though per capita water consumption decreases as family size increases. Developed nations, however, despite having small households, consume more water48. Income has also ambiguous results. For example, refs. 48,49 found that income is statistically significant in water consumption. However, ref. 25 discovered that there are no appreciable differences in water consumption between high and low income groups.

According to ref. 34, educational background does not consistently affect residential water consumption. Refs. 50,51 found that the number of rooms and the housing condition has a significant impact on indoor and outdoor water consumption. Similarly, ref. 52 asserted that new houses with good quality consume more water. It is also reported that installing high indoor efficiency fixtures brings relative water savings between 9% and 12%53, whereas complete appliance replacement with high water efficiency equipment might cut indoor water use by 35–50%53. Ref. 32 reported that climatic determinants have also a significant impact on water consumption.

In addition to identifying influential factors that determine water consumption, establishing a reliable and evidence based future water consumption prediction model is a pivotal input to establish water conscious management plan54. In the previous couple of decades, water consumption models have been prospered55,56. The model types range from straightforward linear regressions and traditional mathematical equations to more complex and data driven machine learning algorithms57. Water consumption models can be roughly divided into two categories called deterministic and probabilistic models58. Deterministic models also known as conventional models commonly depend on extrapolating previous per capita water consumption trends59. Conventional modeling depends on small data. One popular illustration is linear regression, which evaluates the impact of each independent variable on the dependent variable60.

The probabilistic model predicts future outcomes by utilizing the influence of random occurrences. One of the most significant advantages of the probabilistic modeling approach is its capacity to precisely quantify the uncertainty associated with predictions61. One of the best applications of probabilistic modeling is machine learning58. In simplest terms, machine learning is used to produce predictions, and the effectiveness of the technique may be gauged by how well it generalizes to previously unlearned data60. There is growing evidence that several water consumption models have been growing which include linear Regression, Deep Neural Network (DNN), Machine Learning (ML) and Hybrid models. Since now, water consumption models focus on pinpointing key water consumption drivers and making water consumption predictions based on past water consumption trends62,63,64,65.

Nowadays, machine learning and deep learning are two of the newest models being used to model nonlinear data. This is owing to the expansion of data driven research, which is revolutionizing the modeling and management of complex and large data driven systems66,67. Machine learning-based water consumption modeling has drawn increased attention in recent times58. Generally speaking, the current level of water consumption modeling is changing from linear to data driven or AI-based modeling, which mostly uses machine learning modeling and other hybrid models68,69,70.

Machine learning is a component of artificial intelligence that enables computers to predict the future by learning from their prior experiences71. “Machine learning” is a branch of artificial intelligence that tries to develop systems that can learn from prior knowledge, identify trends, and draw logical conclusions with little to no human input72. Ref. 73 discussed that there are at least six major steps for building machine learning model including defining the problem, listing the data inputs and anticipated outcomes, converting the data into a machine learning model compatible format, dividing the data into training and test sets, selecting the machine learning algorithm that best fits the problem, and evaluating the model’s performance.

Machine learning algorithms fall into two major categories called unsupervised and supervised. Unsupervised learning is typically used to identify links between datasets, while supervised machine learning is used to categorize data or generate predictions74. Supervised machine learning can comprise both classification and regression random forest algorithms75. Random Forest is a well-known machine learning algorithm that makes use of supervised learning techniques. A random forest that is constructed from a decision tree algorithmic process is part of supervised machine learning76. Random Forest can be applied in machine learning either in the form of classification or regression algorithms77. One of the important characteristics of the Random Forest algorithm is its capability to handle data sets with continuous variables, as in regression, and categorical variables, as in classification78,79.

In this study, the target variable which is household water consumption per day measured in liters is continuous variable. Hence, for this specific case applying random forest regression is the best option. Particularly, regression in the random forest is a method that makes use of an algorithm to understand the relationship between dependent and independent variables80. Regression model helps to predict numerical values based on multiple data sources74. Random forest is al\so one of the most popular algorithms for regression problems particularly for predicting continuous variables due to its ease of use and superior accuracy75,81. The “R” software platform is also one of the statistical languages that support the random forest application82.

In summary, the research motivation derives from the reason that water consumption is influenced by both location and context. Hence, there is a need for localized research to comprehend the determinants of water consumption more effectively. In Ethiopia, the existing studies have predominantly concentrated on specific communities with limited sample sizes. Moreover, a significant number of these studies have not methodologically tracked, traced, or mapped water sources in a manner that enables the generation of essential information, enabling well-informed decisions regarding water-sensitive interventions. According to empirical findings, limited researches have been done on residential water consumption in developing countries than in industrialized nations. In developing countries, there is currently limited knowledge about how to apply machine learning for water management83,84,85,86.

Given the identified gaps, this study focuses on Adama city to address key questions: Where are the water sources, and to what extent do they support urbanization and growing water supply concerns? What factors influence residential water consumption in Adama city? How do urban neighborhoods differ in terms of water consumption patterns and reliability? Lastly, what practical water-sensitive intervention strategies can be recommended for Adama city. Hence, the study makes three significant contributions. Firstly, it provides essential baseline data on water consumption for the city. Secondly, it adds to the growing body of knowledge through a localized examination of variables affecting household water usage in developing countries. Lastly, it demonstrates the greater reliability of machine learning results over traditional linear regression models for decision support in predicting residential water consumption.

Results

Tracking and baselining city-level water consumption pattern

To understand the city’s water consumption pattern, the study identified the sources of water supply, examined the amount of water distributed, and assessed the discrepancy between water supply and demand. Though historically Adama city’s primary supply of drinking water has come from groundwater, the findings in this regard indicate that groundwater exploration can no longer keep up with the city’s expanding water demand. Groundwater is becoming a less viable source of water due to the declining capacity of the deep wells and the poor water quality caused by rising fluoride content. According to the tracking and tracing analysis, only six of the nine boreholes are operational. The total daily production from the six working boreholes, even with the irregular flow, is about 3024 m3 per day. This implies that serious doubts exist about the well water supply’s sustainability over the long run. Adama’s infrastructure for providing water is far from meeting the city’s expansion areas. The spatial analysis result confirms that Adama’s urban area has grown 22 times in 30 years, from 13,211 hectares in 1991 to 313,211 hectares in 2023. The primary challenges in the water distribution system include aged pipelines, sediment deposits, water leakages, and unauthorized pipeline connections. Additionally, the findings indicate that the water distribution network currently extends to only 45% of the city’s master plan (Fig. 1).

Fig. 1: Water source locations and supply distribution network map of Adama city.
figure 1

The figure presents the location of water sources and distribution networks. Functional boreholes are denoted by circular green and nonfunctional boreholes are indicated by circular blue shapes. A purple pentagonal shape shows the booster pump, and a dark red rectangle represents the service reservoir. Furthermore, water supply transmission is illustrated by blue lines, and brown lines represent the distribution network. The dataset was obtained from Adama City Water Supply and Sewerage Service Enterprise (2023) and the figure is produced by the researchers using GIS application.

Currently, the dominant and probably the only water supply source for Adama city is the Awash River. The Awash River is located 11 km from the city and has a daily output capacity of 43,000 m3 per day. Six service reservoirs are also part of the Adama water supply system. Each of the reservoirs has a capacity of 25, 500, 1000, 1500, 400 and 6000 m3 (Fig. 2). The effective water production from Awash River is estimated to be 40,663 m3 per day. Meanwhile, water demand projection indicated that the estimated average daily domestic and non domestic water demand for the year 2023 is estimated to be 65,226.6 m3 per day. This implies that the daily gap between water demand and supply is about 24,563 m3 per day, or 38% of the water demand is unmet.

Fig. 2: Sankey water supply flow diagram from source to service reservoirs and to customers.
figure 2

The figure displays the water supply flow from the Awash River source to five primary service reservoirs, each with distinct capacities (25, 500, 1000, 1500, 400, and 6000 m3), and then to customers. The dataset was obtained from Adama City Water Supply and Sewerage Service Enterprise (2023) and the figure is produced by the researchers.

Many urban neighborhoods still adopt the practice of centralized point water supply and shift water supply to solve the issues associated with water supply reliability problems. Furthermore, the present analysis found that the water supply and demand balance index (SDBI) is found to be 0.6.

The result confirms that the current supply method ignores interior alternate water harvesting and is subjected to linear water flow. This study also found that the potential amount of roof water that can be collected annually by residences with roof sizes between 09 m2 and 270 m2 ranges from 6 to 181.9 m3. However, the analysis asserts that the traditional roof water collection from the current residential settlement only amounts to 1–3 m3. Adama city’s population, water production, and consumption have shown an increase year by year. In the previous three decades, Adama city grew almost 15-fold in its population, 3-fold its water production and 2-fold its water consumption. The maximum percentage of nonrevenue water (NRW) in Adama city is 20% of total water production (Table 1).

Table 1 Population growth, water production, consumption, and non-revenue water (NRW) (2013-2023)

The result reveals that water consumption residential settlements consume 73.13% of the available water supply. Commerce and government sectors account for 15.33% and 8.53% of total water consumption respectively (Table 2).

Table 2 Proportion of water consumption by sector in Adama city (m3 per year)

Adama city’s temporal monthly water consumption is distinct from other cities. Contrary to most cities, the rainy season (June, July, August, and partially September) exposed to a rise in water consumption. According to the technical head of the Adama city Water Supply and Sewerage Service Enterprise, there are root causes for such temporal water consumption variations. Initially, during rainy season the river’s flow increases, allowing for adequate water production and even water supply distribution that causes to avoid water rationing. Secondly, the majority of municipal communities have adequate water during the rainy season because water supply pressure becomes normal. But during the dry season, water output decreases and the supply is rationed, resulting in less consistent water availability for each settlement (Fig. 3).

Fig. 3: Monthly water consumption variation (m3 per month) in Adama city (2022/23).
figure 3

The figure presents the monthly temporal water consumption patterns. The findings reveal that residential water consumption, indicated by blue dots, exhibits notable fluctuations over time. Specifically, there is an increase in residential water consumption during the rainy season, attributed to the upsurge of water supply from the source during the rainy season. The dataset was obtained from Adama City Water Supply and Sewerage Service Enterprise (2023) and the figure is produced by the researchers.

Response rate and background of the respondents

A door-to-door household questionnaire survey was used to acquire the residential water consumption characteristics of the households. A guided map was created to search the sampled households. If they were not at their house, a convenient time and appointment were made. This strategy resulted in a response rate of 100%. According to the analysis of the household questionnaire survey, just 28% of the 400 sample households are headed by women, while 72% of the households are headed by men. 79% of respondents reported they are married, 14% are widowed, and 7% are divorced. In terms of respondents’ educational backgrounds, 16% of respondents have a degree or higher, 4% diploma and certificate, 34% have finished elementary school, 28% have finished high school, 11% can read and write, and 7% are illiterate. Furthermore, 35% of household heads work for themselves, 15% work for the government, 26% work for businesses, and 16% are unemployed and the rest 8% other sectors.

Household water consumption characteristics

According to the current study, the average household’s daily water consumption is estimated to be 586 liters per household. On the other hand, it is estimated that each person consumes 69.2 liters of water each day on average (Fig. 4).

Fig. 4: Water consumption per household and per capita per day across the sample households.
figure 4

In the figure, the water consumption metrics per household and per capita per day in liters are presented for the sampled households. The blue data points represent the water consumption per household at 586 liters per day, as well as the water consumption per person per day at 69.2 liters. Additionally, the median and standard deviation for both household and individual water consumption are depicted in black and dark red, respectively. The figure is produced by the researchers from household questionnaire survey (2023).

The findings show significant variations in average daily water consumption among households throughout the urban settlement. Particularly, intermediate settlements exhibit higher water consumption per household compared to both central and peripheral settlements. The city center records the highest average daily water consumption per person (Table 3).

Table 3 Water consumption per household and per capita per day across central, intermediate, and peripheral neighborhood settlements

Furthermore, the study reveals a correlation between housing quality and water consumption, with high-quality houses using three times more water than lower-quality ones (Fig. 5).

Fig. 5: Relationship between average household water consumption and housing conditions.
figure 5

In the figure, the relation between average household water consumption in litters and housing conditions is demonstrated. The varying color denote different categories: blue signifies water consumption for houses categorized as “bad,” black represents “fair” conditions, green shows to “good,” and dark red indicates houses in “very good” condition. The figure is produced by the researchers from household questionnaire survey (2023).

Additionally, formal residential parcels show 2.4 times higher water consumption than informal ones (Fig. 6). This discrepancy is attributed to the remote locations and poor infrastructure of informal parcels, coupled with less reliable water supply in these areas compared to formal residential settlements.

Fig. 6: Relationship between average household water consumption and parcel legal status.
figure 6

In the figure, the relation between average household water consumption in litres and parcel legal status is illustrated. The black color represents the daily water consumption of informal houses, while the dark red color shows the average daily water consumption of formal house. The figure is produced by the researchers from household questionnaire survey (2023).

The households also reported that due to the unreliable water supply, local communities are forced to purchase a decentralized water supply from truck tankers. The result also reveals that 40% of the residential area of the city receives at most onetime per week, 24% of the area of the city receives 1–2 times per week, 18% of the area of the city receives 2–3 times per week, 13% of the area of the city receives 3–4 times per week and 5% of the area of the city receives greater than five times per week. Additionally, 70% of houses get water at least four times per week, compared to 30% who get it no more than three times per week (Table 4).

Table 4 Water supply reliability per week across the sampled households

Water consumption conservation practices across the sampled households

Concerning the installation of water-saving devices and the retrofitting of old fixtures, residents sometimes exhibit habits in addressing water leaks in faucets and toilets. Nevertheless, installing water saving equipment with flow controls and sensors is still never for families (Table 5).

Table 5 Installing water saving devices and retrofitting old fixtures practice

Meanwhile, the households have sometimes a habit of turning off the water when taking soap baths, cleaning their faces, and brushing their teeth. In the kitchen, there is sometimes a habit to shut off the tap while washing the dishes. Nonetheless, household members still occasionally limit their shower times to at most five minutes and hardly use recycled water as an alternative source of water (Table 6).

Table 6 Household water saving behavior in indoor engagements

The household members, water their gardens frequently in the morning or afternoon when evaporation is lower. In the meantime, family members hardly ever practice using alternative water sources for outdoor service (Table 7).

Table 7 Outdoor water conservation practice

Predicting residential water consumption from socioeconomic and spatial information

The “K” fold cross validation method is used in machine learning modeling to evaluate predictions. There are “K” folds or subgroups in the dataset and a new fold is used as the validation set each time the model is trained and evaluated “K” times87,88. Similarly, this study also employed 10-fold cross validation that was repeated 5 times. According to the model training summary, the R2 value is 77.4%. This means that to an accuracy of 77.4%, the independent variables may predict the value of the target variable or the water consumption per household per day in liters (Table 8).

Table 8 Model train result

The random forest regression (RFR) approach was used to construct the feature importance plot which shows the factors that are most crucial for predicting household water consumption (water consumption per household per day in liters). Table 9 presents the variables along with their importance scores.

Table 9 Variable importance

Table 10 shows that the model’s performance (R2 score) is 77%. Besides, based on the prediction raster map output, this study found that the minimum water consumption prediction ranges from 229 to 455 liter per household per day. The maximum water consumption prediction is 682 to 909 liter per household per day (Fig. 7).

Table 10 Model testing result
Fig. 7: Residential water consumption prediction raster output map of Adama city.
figure 7

Figure 7 displays the output of a water consumption prediction raster map, generated through the integration of various factor maps using R Software. The color-coded representation includes cyan for households consuming more than 1135 liters per day, green for water consumption ranging from 909 to 1135 liters per day, orange for consumption between 682 and 909 liters, purple for consumption between 455 and 682 liters, and a peripheral area in pink, representing households with a water consumption ranging from 229 to 455 liters per day. The figure is produced by the researchers using Random Forest Regression in R Software (2023).

Proposed water sensitive intervention approaches for Adama city

The first water sensitive intervention can be achieved by improving indoor and outdoor water conservation practices at the household level. Water is mostly used in households for drinking, cooking, cleaning, bathing, flushing toilets, and watering lawns and gardens. Therefore, the first step to create water wise household is to improve the household’s water saving practices. Installing water saving fixtures like faucets, toilets, showers, retrofitting old fixtures, educating the entire family to adopt water saving behaviors like shutting off the faucet while preparing dishes, brushing their teeth, and washing their faces, as well as turning off the shower while using soap and taking quick showers, are some examples of this initiative. Outdoor water saving strategies that households should implement include planting drought tolerant trees, reusing indoor water use for gardening, and watering the landscape earlier in the day when it is cooler or water evaporation is less.

The second approach can be linked with ensuring reliable water supply by expanding alternate water sources. Diversification of sources increases the availability of water for households, which is one of the best ways to guarantee a sustainable water supply. Increasing the amount of centralized surface and subsurface water supplies, collecting rainwater, and reusing or recycling wastewater are the major components of the supply diversification strategy. The third strategy can be associated with understanding the diverse water competitions and introducing the function of a fit for purpose approach. It is a well-known fact that competition for water is growing rapidly as settlements across the world become more urbanized. Adama city is no exception. Therefore, the municipality must develop an approach that balances the competition among sectors centering on a fit for purpose water supply system. The fourth strategy can be linked with striving to support the transition to a water sensitive city. City level water management should be in line with ensuring flexible systems that integrate different sources of water, operate through both centralized and decentralized systems, and provide a wider range of opportunities to communities at city level.

Discussion

Historically, Adama city has been depended on groundwater as its primary drinking water source. However, this time, groundwater is becoming a less viable source of water due to the declining well capacity and deteriorating water quality attributed to an upsurge in fluoride content. Currently, the dominant and probably the only water supply source for Adama city is the Awash River. According to ref. 89 who looked into the city’s water supply, claimed that if the Awash River supply failed, Adama city’s water supply issue would get worse. Similarly, scholars report that the “one source supply option” cannot be a sustainable solution for the growing water scarcity7,13. The result also demonstrates that the current water consumption management ignores interior alternate water harvesting approaches and is subjected to linear water flow. Studies also report that in developing countries diversified water supply harvesting methods are not potentially developed90,91. Ref. 92 also suggests that implementing alternative water sources such as rainwater harvesting can significantly narrow the gap between water supply and demand.

This study specifically pinpointed socioeconomic, spatial, and climatic variables that determine residential water consumption. Accordingly, the study reveals that household income and size are the most influential from the socioeconomic variables32,93,94. Also reported that household size is one of the unwavering water consumption determinants. Refs. 48,49,95 found a positive relationship between income and water consumption. The result demonstrates that houses in a better quality have significant impact on water consumption. In a similar vein, ref. 96 reported that there is a significant correlation between household water consumption and housing quality. Number of rooms is also one of the significant variables that impacts water consumption in Adama city. Refs. 97,98 witnessed that water consumption increased in proportion to the number of rooms. The result found that formal parcels have a higher predictive capacity than informal ones. Ref. 99 also reported that formal parcels consume more water than the informal parcels. In terms of climate variables, minimum and maximum temperatures as well as total yearly precipitation stand out as having a major impact on Adama city’s household water consumption. According to refs. 24,100, climatic variables such as maximum temperature have an impact on water consumption patterns. From the topographic variables aspect and elevation have the highest influence on water consumption from all the topographic variables. Similarly, refs. 101,102 reported that natural topography affects water supply distribution and in turn supply reliability.

Given that a sizeable portion of urban water consumption comes from residential settlements, it is crucial to fully understand the factors that influence water consumption and develop a predictable model so that urban planners and water resource managers can make more informed decisions. Machine learning is one of the most innovative approaches that address the appealing issues. Machine learning can be used to analyze water consumption trends, identify factors that influence water consumption, and develop water consumption predictive models. This allows municipalities to better anticipate water demand and adjust their supply accordingly. To sum up, this study also identifies limitations that call for additional investigation. One of the study’s shortcomings is that it only used one city as a case study, which makes it difficult to draw broad conclusions or develop knowledge that can be applied generally. Thus, future research will look into other cities that aid in producing comprehensive and generalizable knowledge for making decisions.

Methods

Overview of the study approach

The current study used both Top down and Bottom up data collection approaches. The municipal water supply database served as the foundation for the Top down data collection. The database contains monthly city level water consumption records. The database was an important source of information to quantify the share of water consumption in residential, commercial, industrial, service, and other sectors. The database was also highly helpful for measuring and evaluating city level water production, consumption, and non-revenue water. In contrast, the main approach for getting bottom up data was a questionnaire survey. The data sources for the questionnaire survey were households. In the household survey, the monthly water consumption of each household was collected from their monthly water bill records. To measure the water conservation behavior seven point likert sale measurements was used. Machine learning was utilized in combination with socioeconomic and spatial information to identify the residential water consumption influential factors and to generate residential water consumption model.

Dependent and independent variables

The response variable in this study is the household’s daily water consumption, measured in liters per household per day. In essence, this study used three classes of independent variables. The first is the socioeconomic variable, which includes household size, monthly income, building quality, parcel characteristics, the reliability of the water supply, and the location of the residence to the city center. The second variable is the topographical which comprises slope, Topographic Position Index (TPI), Topographic Ruggedness Index (TRI), and aspect, and the third is climatic variable that includes mean monthly minimum temperature, mean monthly maximum temperature, and annual total rainfall (Table 11). Finally, by combining all the three factors a single regression matrix was created. First, the socioeconomic data was converted to spatial data (point vector data) on the GIS platform by using the joining and relating principle. The joined data was used to create raster data using interpolation. This was carried out because the socioeconomic variables were collected with unique identifiers and geographic locations. The socio economic, topographic, and climatic raster data were then processed in the “R” software to produce an aggregated raster file. Lastly, the attribute value of the combined raster file was exported to Excel to create the regression matrix, which serves as the final input for the training data, cross validation, and model test.

Table 11 Data sources and attribute types

Generating a random sample and recruiting respondents

The recruitment or selection of household respondents is influenced by the choice of the sampling frame. The sampling frame for this inquiry is the list of households that are present in the residential communities. The list of households and their parcel attributes (names, parcel area, and geographic location data) were extracted from the municipal land inventory using GIS software. To accomplish this, the database was changed to Excel format to use random number generators in Excel. After the sampling frame was created, successive parcel “id” numbers were assigned to use simple random sampling to select households from the list of sample frames, giving each item an equal chance of being chosen. To draw representative samples, first, the city settlement was classified into three settlement categories (central, intermediate, and periphery settlements) to obtain a geographical representation of the residences and to assess the spatial difference in water consumption and reliability as we move from the center to the periphery. The target population (N) in this study was estimated to be 95,823 households, which is defined as all residential water consumers in Adama city. The sample size was determined using Yamane’s formula. Yamane (1967) provides a simplified formula to calculate sample sizes (Eq. 1). Based on the formula a total of 400 households where 141 households from central, 149 households from intermediate, and 110 from peripheral settlements were considered. Finally, the proportion approach was used to calculate the sample size for the core, intermediate, and periphery samples (Table 12).

$$n=\frac{N}{1+N({e}^{2})}$$
(1)

where “N” is population size, “n” is sample size and “e” is margin of error (5%).

Table 12 Sample size proportion of residential water users at three spatial scopes

Water conservation perception and behavior measurement

A seven point Likert scale was developed to measure respondents’ attitudes towards water conservation behavior about installing water saving appliances as well as indoor and outdoor water conservation behavior. The rating scales were represented as 1 never, 2 rarely, 3 occasionally, 4 sometimes, 5 frequently, 6 usually, and 7 every time.

Raster data preparation or model input raster data preparation

To predict residential water consumption a random forest regression (RFR) model was applied. The RFR model was created using 16 rasterized variables. Raster map layers containing geographical data, such as topographical and climatic characteristics, were constructed in a GIS environment as the prediction’s input. A digital topographic map with a 2 m contour interval was subjected to 3D analysis to create elevation, slope, Topographic Position Index (TPI), Topographic Ruggedness Index (TRI), and aspect factor maps. To create a socioeconomic raster dataset, the socioeconomic point data were also interpolated using the Kriging method. The following are sample maps that are used to generate the prediction water consumption raster map (Figs. 813).

Fig. 8: Total household income distribution map.
figure 8

Figure 8 shows the spatial distribution of household incomes in Birr. The colors denote varying income groups, with green representing households earning between 2000 and 4000 Birr per month, aqua for those with incomes between 4000 and 6000 Birr, yellow for the 6000−8000 Birr range, orange for 8000−10000 Birr, and dark red for households earning more than 10,000 Birr. The figure is produced by the researchers using GIS from household questionnaire survey (2023).

Fig. 9: Total beneficiary per a meter connection map.
figure 9

In the figure, the spatial distribution of water beneficiary households is indicated. The color coded legend shows different household sizes. Green indicates to households with 4–7 members, aqua represents households with 7–9 members, yellow denotes those with 9–11 members, orange indicates households with 11–13 members, and dark red shows households exceeding 13 members. The figure is produced by the researchers using GIS from household questionnaire survey (2023).

Fig. 10: Water supply reliability distribution map.
figure 10

In the figure, the spatial distribution of water supply reliability is shown. The color pattern is indicative of the frequency of water availability for households. Dark red shows households receiving water once per week, orange represents those receiving water 1–2 times per week, yellow indicates households with water supply 2–3 times per week, aqua denotes 3–4 times per week, and green reflects households getting water access more than 5 times per week. The figure is produced by the researchers using GIS from household questionnaire survey (2023).

Fig. 11: Digital elevation map of Adama city.
figure 11

Figure 11 presents the topographic elevation map of the Adama. The color gradient, transitioning from dark red to green, denotes the elevation variations with an increase observed from the red to green color. The dataset was obtained from National Metrology Institute of Ethiopia (NMIE) and the figure is prepared by the researchers using GIS application.

Fig. 12: Mean monthly maximum temperature.
figure 12

Figure 12 presents mean monthly maximum temperature. The color gradient, transitioning from green to red, denotes the temperature variations with an increase observed from the green to red color. The dataset was obtained from National Metrology Institute of Ethiopia (NMIE) and the figure is prepared by the researchers using GIS application.

Fig. 13: Total annual rainfall map of Adama city.
figure 13

Figure 13 presents total annual rainfall map of Adama city. The color gradient, transitioning from red to green, denotes the rainfall variations with an increase observed from the red to green color. The dataset was obtained from National Metrology Institute of Ethiopia (NMIE) and the figure is prepared by the researchers using GIS application.

Model execution procedure

In this study, machine learning was used in conjunction with the Random Forest Algorithm. The following primary steps were taken to generate the final water consumption raster map.

Preprocessing stage

The data cleaning procedure was mainly conducted to identify and correct data entry errors and outliers. The missing values resulting from incorrect data entry were identified by looking at the frequencies for each variable in SPSS. After looking through the number of “Valid and Missing” values derived from the frequencies result, just two missed values were discovered. To correct the missed values, the original questionnaires were tracked down using their unique ID, and the correct recording was completed. The Z score detection approach was used to find the outliers. The Z score method is one of the most used techniques in machine learning for outlier detection103. If a data point has a Z score greater than 3 or less than −3, it is considered as an outlier103,104. To obtain the Z score for an observation, the raw measurement is subtracted from the mean, and the standard deviation is divided. The formula for Z score is denoted by Z, while the mean is represented by μ and the standard deviation by σ (Eq. 2).

$$Z=\frac{X-\mu }{\sigma }$$
(2)

Using the Z score outlier identification approach, two outlier data were found out of 400 samples with a Z score value greater than three. After understanding the outlier characteristics, the researchers decided to replace the outliers using the median value imputation approach rather than total removal. As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value105. After correcting errors, data transformation was the next step. Data transformation is the process of converting unstructured data into a form that can be used by machine learning106. As part of this step, the integrated Excel data (regression matrix) was first converted to a CSV which is compatible with the “R” software.

Data splitting into train and test procedure

Building a model depends on training data whereas prediction depends on testing data107. In data partitioning determining the partition ratio is important108. Scholars use various splitting ratios such as 80:20 and 70: 30109. But depending the size of the data ratio is not consistent across all researchers110. This study also considered 90:10, which implies that 90% of the dataset passed in the training dataset and 10% in the testing dataset.

Model training procedure

Following the splitting process, cross validation was done using the random forest technique or the Bagging (Bootstrap Aggregating). In the random forest regression, a Bootstrap is constructed to get R2, MSE, and RMSE111.

Model evaluation against the test data procedure

The hydroGOF in the R package was used to measure the goodness of fit between observed and simulated values. Besides, the model evaluation was used to check the Mean Error (me), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Normalized Root Mean Square Error (NRMSE), and Coefficient of Determination (R2)

Software used

As a branch of data science, it is now widely known that machine learning primarily involves dealing with a lot of data and statistics68,69,70. The use of the “R” programming language is the most prevalent among the rising number of programming tools for machine learning applications112. Working with machine learning requires the “R” programme, which makes tasks simpler, quicker, and more creative113. To determine the variables that influence residential water consumption and to produce a raster map of predicted residential water consumption raster map, this study also used the Random Forest algorithm in “R” software. We used the “R 4.0.5” version with packages/libraries (https://www.r-project.org/), “caret” for model training and validation (especially ranger function), terra - for spatial data analysis, and “dplyr” for working on the data frame/CSV file.

Third-party material

All the socioeconomic figures, including household income, beneficiary size, housing quality, supply reliability, location, parcel legal status, and area, were gathered through household surveys and processed using geographic information system (GIS) applications by the authors. The data sets for climate and topography were sourced from the National Metrology Institute of Ethiopia (NMIE) which are legally and freely provided for academic research purposes. Then the topographical and climatic figures, such as the digital elevation, temperature, and rainfall, were prepared in geographic information system (GIS) software by the authors themselves. Municipal level water consumption data was obtained from the Adama City Water Supply and Sewerage Service Enterprise’s database following proper legal procedures. Ethical considerations were a priority throughout the dataset acquisition. The authors carefully obtained necessary permissions from relevant offices, ensuring compliance with ethical and legal standards.