A robust energy management system for Korean green islands project

Penetration enhancement of renewable energy sources is a core component of Korean green-island microgrid projects. This approach calls for a robust energy management system to control the stochastic behavior of renewable energy sources. Therefore, in this paper, we put forward a novel reinforcement learning-driven optimization solution for the convex problem arrangement of the Gasa island microgrid energy management as one of the prominent pilots of the Korean green islands project. We manage the convergence speed of the alternating direction method of multipliers solution for this convex problem by accurately estimating the penalty parameter with the soft actor-critic technique. However, in this arrangement, the soft actor-critic faces sparse reward hindrance, which we address here with the normalizing flow policy. Furthermore, we study the effect of demand response implementation in the Gasa island microgrid to reduce the diesel generator dependency of the microgrid and provide benefits, such as peak-shaving and gas emission reduction.

• Providing a convex problem arrangement of Gasa island EMS considering DR and load flow constraints to make profits for consumers and the utility grid. • Arrangement of a SAC-based solution to estimate the penalty parameter of ADMM to support the highdimension and complex problem of Gasa island EMS. • Solving sparse reward hindrance with a less computational burden on the learning process of SAC algorithm by arranging high-density action space with the NFP approach. • Exploration of less dependency on conventional generators and acquisition costs with DR implementation.
The remainder of this paper is organized as follows. By specifying the objective function and microgrid elements constraints, "Problem formulation" section formulates the problem. "Proposed method" section represents the solution method, and "Results and discussion" section investigates the novelty of the proposed solution by analysis of the results and compare with benchmark methods. Later on, "Conclusion" section discloses the most relevant conclusions of our work.

Problem formulation
Microgrid objective function. In this paper, we consider two scenarios for the EMS arrangement of the Gasa island microgrid. The first scenario includes photovoltaic cells (PV), wind turbines (WT), DGs, ESSs, and loads, as shown in Fig. 1. In the second scenario, we schedule the DR for the residential load to decrease the peak-average ratio. This approach reduces DG consumption and carbon emission production, making the green island a more practical objective. Consequently, we define the objective function of the Gasa island microgrid as minimizing power generation costs for the first scenario, and we enhance that by attaching minimization of power consumption cost for consumers through DR implementation. We formulate the objective function of the Gasa island microgrid, considering two scenarios as follows.  www.nature.com/scientificreports/ The first term of (1) calculates the cost of power generation, and the second part is the power consumption expenses. According to Fig. 1, C g can be applied to power generation units that include PV, WT, DG, and ESS in discharging mode. The cost of start-up and shut-down applies to conventional units, which is DG in this study. Since we plan to reduce peak load through implementing DR, consumption cost minimization is meaningful for this program participant loads.
Microgrid elements modeling and constraints. The main objective of CO 2 reduction in green island advent encourages priority of RESs in supplying loads. Therefore, we decline the cost of RESs power generation. The only RESs constraint is the maximum amount of power that can produce.
where P PV max and Q PV max are the maximum active/reactive power generation of PVs. P WT max and Q WT max are WTs' highest amount of active/reactive power production.
DG is the only conventional generator in the Gasa island microgrid. In addition to its limitations concerning the amount of energy generated, the DG faces constraints regarding its working duration and variation in power output as follows.
where T DG up and T DG down determine minimum up and down time, respectively. u DG is a binary value that shows the DG is on or off. The DG's power generation cost is calculated according to (10).
P DG min ≤ P DG ≤ P DG max , where η ch and η dch are ESS charging and discharging efficiency. SoC min and SoC max determine up and down limits of state of charge (SoC), and τ is the time slot. This study estimates the battery degradation cost based on (14).
where E b,rated and E b(t) are rated stored energy and available energy of battery at time t, respectively. η leakage determines leakage loss, σ is a coefficient to calculate battery degradation during its lifespan, c b,inv is the initial investment to provide battery, and n ch,dch is the number of full charge and discharge cycles. Gasa island includes 164 households 25 . To model the residential load, we hire the Enertalk open dataset that consists of the per appliance load consumption of 22 houses in Korea 26 . This dataset provides commonly deployed appliances' power consumption data of a Korean household, including refrigerator, Kimchi refrigerator, water purifier, rice cooker, washing machine, and TV. We aggregated the power consumption of appliances by around 13% based on 27 to estimate the heating and cooling system. We implemented DR on residential consumers to decrease the peak-average ratio. Therefore, we categorize the Enertalk appliance records into non-controllable, shiftable, and reducible loads. TV, Kimchi refrigerator, refrigerator, and rice cooker are non-controllable loads. The washing machine and heating and cooling system are shiftable and reducible loads, respectively. Therefore the electricity consumption of the heating and cooling system during the DR program based on inside building temperature ( Temp in ) has the following restrictions 28,29 .
where Temp in (τ ) is the inside building temperature deviation during time step τ . Temp in min and Temp in max are the inside building maximum and minimum desirable temperature. β and γ are the building thermal capacitance (kWh/ • C) and reactance ( • C/kW), respectively. P H&C (t) is the power usage of the heating or cooling system in time t.
To calculate the cost of load, we consider the time of use (TOU) price ( p TOU t ) according to Table 1 30 . To provide a trade-off between utility profit in decreasing peak-average ratio with higher TOU in peak hours and consumer comfort, we add the anxiety rate term ( A rate ) to the cost function to respect consumers' preferred power consumption rate.
In the case of the heating and cooling system, the DR affects the customer comfort where the inside temperature ( Temp in (t)) exceeds limitations that are defined in (17) 31 .
(10) c DG (t) = a 1 + a 2 P DG + a 3 P DG 2 , www.nature.com/scientificreports/ where K is the anxiety rate coefficient. Additionally, according to the customers' preferred time of using the washing machine, the anxiety rate of the washing machine ( A WM rate ) participating in DR is determined as follows.
where h s and h f are the lower and upper preferred times of using the washing machine, respectively. σ and ζ determine penalty coefficients of shifting washing machines out of consumers' preferred time.
We deploy the water station load profile from 32 and simulate school daily power consumption from 33 . The amount of lighthouse and radar base power consumption of Gasa island in the average daily load profile decline in this paper. During power dispatch scheduling, we consider load flow constraints for each branch of the Gasa island microgrid as follows.
where, where i and j are bus indexes. P i,j (t) and Q i,j (t) are active and reactive power flow of lines between i and j buses at time t. V i min and V i max are lower and upper voltage limitations of each bus. P SCh , P WP , and P Lh are the school, water pump station, and lighthouse active power consumption, respectively. P l,b denotes non-DR participants' residential load, and P l,dr is DR participants' residential load. P H&C,dr , and P WM,dr are the amount of heating and cooling systems and washing machines' active power consumption that contributes to DR. We modify the objective function in (1) based on Gasa island microgrid constraints as follows.

Proposed method
ADMM. ADMM is a popular and reliable approach for convex quadratic programming. The conventional ADMM method attempts to solve the following problem 34 : where is a relevant vector of multiplier for constraints (32), and ρ is the penalty factor. ADMM iteratively updates optimization variables according to the following.
This iterative procedure will converge when the primal and dual residuals of ADMM techniques meet their thresholds ǫ p and ǫ d , respectively, as follows.
where r p k and r p k represent the primal and dual residuals of the ADMM technique, calculated according to (39) and (40), respectively.
We deploy the decompose technique introduced in 23 to solve our ADMM problem. To this end, we consider two penalty parameter vectors as follows.
where ρ PQ is the penalty parameter for active and reactive power and ρ V θ denotes voltage and phase angle penalty parameter. n PQ and n V θ are the number of constraints for each pair of active and reactive power and voltage and phase angle, respectively.
Since penalty parameters are determinant factors in ADMM dual and primal in each iteration to converge according to (37) and (38), the sequential characteristic of these parameters calculation encourages using the DRL method to estimate penalty parameters. SAC algorithm with enhanced exploration. SAC is an actor-critic DRL method with sample efficiency specification of its off-policy approach 24 . SAC works in both discrete and continuous environments. The main characteristic of SAC is stochastic-based policy optimization by adding entropy to policy. This characteristic gives advantages of productive exploration and, consequently, a higher convergence rate to off-policy methods, such as deep deterministic policy gradient (DDPG) and twin delayed DDPG (TD3). Its sample efficiency due to learning from experience saved in the replay buffer is superior to on-policy techniques such as proximal policy optimization (PPO) and trust region policy optimization (TRPO). The entropy term ( H(.) ) is defined according to (29). The portion of entropy in the learning policy is determined by temperature parameter α , which is decreased during the learning iteration as follows.
The entropy term will update the Bellman equation of the value network training process according to (45).
x ∈ R n , x ′ ∈ R n ′ , H(π(a t |s t )) = E a∼π(.|s) −log(π(a|s)) . www.nature.com/scientificreports/ The critic in SAC includes value function V ψ and soft Q-function Q θ . The actor contains policy network Q φ . The policy network in SAC chooses action from Gaussian probability distribution according to the squashing function f φ (ǫ;s t ) as follows.
To use SAC for parameter estimation of the ADMM, we need to arrange the Markov decision process (MDP), including state, action, transition function, and reward. The state will be the decision variables of the ADMM problem, which are the Gasa island microgrid elements' active and reactive power as follows.
SAC will predict the suitable penalty parameter from the continuous action space according to (48), (49).
The reward function is defined as follow.
However, this reward function is a sparse reward. The NFP is a trick that is used in this study to empower stability and provision of efficient action space exploration to defeat the sparse reward issues 35 . A set of invertible functions establishes normalizing flows. With the change of probability distribution variables, normalizing flows sequentially transform a distribution to a more density distribution as follows 36 .
where z 0 is the base distribution and z N is the final flow. The density of continuous variable the z N parametrized by φ is as follows.
One of the simplest methods to determine invertible function f is RealNVP. We deployed the method introduced in 37 to combine SAC and normalizing flows. We will reparametrize the Gaussian distribution of action selection with RealNVP invertible transformation as follows.
where π(a|s) = z 0 and the log density of action is as follows.
Therefore, the normal SAC will be modified by adding a gradient step on the normalized flow layers during the φ setting.
The proposed technique to determine penalty parameter of ADMM with the contribution of SAC and NFP is represented in Fig. 2.

Results and discussion
We deploy our proposed algorithm to optimize the Gasa island microgrid EMS in this section. The optimization takes place hourly and can be extended to shorter intervals based on available data. Figure 3 depicts the load profile for the island power consumers based on "Microgrid elements modeling and constraints" section. Since the residential load profile is for August, we choose the WT and PV output power taken from 38 and the output temperature 39 according to this period, as shown in Fig. 4. We simulate the microgrid and implement a centralized ADMM solution as a baseline and our optimization solution with the CVXPY package 40 to use OSQP solver 41 in the Python environment. The simulations are accomplished on a PC with Intel(R) Core (TM) i5-10400F CPU @ 2.90GHz.
The SAC algorithm's performance was compared with different hidden layer numbers and batch sizes to justify DNN parameters. Figure 5 shows how the learning process of DNN varies depending on hyperparameters. Based on the results, two layers with a batch size of 128 will result in a trade-off regarding learning convergence, stability, and complexity. It will not be advantageous to extend the network size to three layers and the batch size to 256 according to Fig. 5b. Table 2 depicts the microgrid element specifications and solution algorithm parameters.
(46) a t = f φ (ǫ t , s t ) = tanh(µ φ (s t ) + σ φ (s t )ǫ t ), ǫ ∈ N (0, 1). www.nature.com/scientificreports/ Figure 6 demonstrates the convergence speed comparison of the proposed algorithm and conventional ADMM. The effectiveness of our hired technique appears in this figure, where surplus convergence of vanilla ADMM based on the number of iterations. By learning the best policy, the RL agent speeds up convergence for dual and primal residuals to 300 and 500 iterations, respectively, while for normal ADMM, this process time is double. Figure 7 shows the Gasa island microgrid economic power dispatches along with the SoC level of BESS with and without DR. This figure illustrates the effectiveness of DR planning in reducing DG utilization during the understudy day. Without DR, the DG starts to work with a higher power generating amount of 315 kW at 1 a.m. to compensate for the shortage of RESs power shown in Fig. 7a. However, DR deployment causes this amount to reduce to 210 kW, as can be seen in Fig. 7b. Additionally, without DR, DG, due to ramp-down time limitations, should stay working on 315 kW, although the amount of shortage power is lower than this amount of generation. Moreover, we also solved the QP arrangement of Gasa island with the CPLEX solver as an analytical method. Table 3 represents our proposed SAC-NF-ADMM to improve the performance of ADMM in case of operational cost for under study day. Figure 8 shows the voltage magnitude of the whole Gasa island power network nodes. This figure delineates the voltage magnitude of the island power network stay in the allowed range between 0.99 and 1.01 p.u.
The effects of DR scheduling on the cooling system and washing machine are represented in Figs. 9 and 10. The washing machine takes part in the DR program by adjusting operating hours to off-peak and mid-peak with lower TOU pricing. The results of DR implementation show that the anxiety rates perfectly direct the optimization problem to keep the desired time of washing machines' working hours. As we discussed before, the washing machine's preferred time of working is between 4 and 12 a.m. Therefore, as can be seen in Fig. 9, the washing machine power usage between 9 and 5 p.m. with higher TOU transferred to other times of day in the desired range.
The optimization algorithm justifies indoor temperature in the desired temperature between 23 and 25 • C. The inside temperature tends to be higher between 10 a.m. and 5 p.m., where TOU is the highest amount, resulting in less power consumption for the cooling system compared to situations without DR schedules. Figure 11 reveals the DR scheduling resulted in peak shaving of around 20% for residential load during peak hours. The green color shaded area in Fig. 11 shows peak shaving, while a red color shaded area signifies a transferred portion of peak load relating to washing machine usage. The most significant benefit of DR scheduling for consumers is dropping 21% in total consumer electricity bills, as illustrated in Fig. 12. After DR implementation, DG cost dropped by around 42%, which is another evidence of reduced DG usage and fewer gas emissions. www.nature.com/scientificreports/ Here, we deployed the TOU policy to schedule DR scheduling currently utilized in the Korean power system. TOU is one of the price-based DR policies. It is also possible to implement DR on Gasa Island using other pricebased DR models, such as real-time pricing (RTP) and critical peak pricing (CPP). Furthermore, incentive-based DR methods, including direct load control and emergency demand response programs (EDRP), can be used jointly with price-based DR with incentive payments from utilities to increase customer profits and encourage them to participate in DR. Therefore, in our future investigation, we will use a combination of price-based and incentive-based DR techniques to study their effect on CO 2 emission reduction. On the other hand, to completely meet the current situation of Gasa island, we deployed the WT and PV output power historical data. However, in our future attempt to consider uncertainties in WT and PV power generation, we deploy long short-term memory (LSTM) based solution to provide the RESs predictor. www.nature.com/scientificreports/

Conclusion
In this paper, we arranged the EMS unit for the Gasa island microgrid as one of the prominent Korean Green island project pilots. The proposed approach improves the main objective of this green island microgrid from the feasible framework for RESs utilization to a profitable microgrid for consumers with DR deployment. Additionally, our method resulted in less dependency on DGs with DR schedules. The ADMM-based solution for EMS provided a fast converged process of optimization by penalty parameter prediction with the state-ofthe-art DRL method SAC. We released each iteration from the computational cost of transferring dual variable     www.nature.com/scientificreports/ variants by definition of an independent, constant reward to each converged iteration. However, this approach resulted in a sparse reward hindrance for the process of training the agent. To overcome this problem, we used NFP to increase the probability distribution of policies. The results showed the proposed ADMM converged 50% faster than vanilla ADMM. Additionally, the implemented DR scheduling on reducible and shiftable residential load decreased 20% of the peak load. Since in this paper we considered current situation of Gasa island microgrid network TOU which is utilized DR policy in Korean power system conisdered to implement DR. In our future     0  20  40  60  80  100  120  140  160  180  200  220  240  260  280  300  320  340  360  380   www.nature.com/scientificreports/ work, we will develop our study with utilizing CPP and EDRP. Furthermore, we will hire LSTM based predictor to estimate RESs output prediction. www.nature.com/scientificreports/

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.