Autonomous navigation of stratospheric balloons using reinforcement learning

Bellemare, Marc G.; Candido, Salvatore; Castro, Pablo Samuel; Gong, Jun; Machado, Marlos C.; Moitra, Subhodeep; Ponda, Sameera S.; Wang, Ziyu

doi:10.1038/s41586-020-2939-8

Article
Published: 02 December 2020

Autonomous navigation of stratospheric balloons using reinforcement learning

Marc G. Bellemare ORCID: orcid.org/0000-0002-6096-0105¹,
Salvatore Candido ORCID: orcid.org/0000-0002-5847-0617³,
Pablo Samuel Castro¹,
Jun Gong³,
Marlos C. Machado¹,
Subhodeep Moitra¹,
Sameera S. Ponda³ &
…
Ziyu Wang²

Nature volume 588, pages 77–82 (2020)Cite this article

14k Accesses
113 Citations
415 Altmetric
Metrics details

Subjects

Abstract

Efficiently navigating a superpressure balloon in the stratosphere¹ requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques^2,3. Here we describe the use of reinforcement learning^4,5 to create a high-performing flight controller. Our algorithm uses data augmentation^6,7 and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems⁸. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Station-keeping with a superpressure balloon.**

**Fig. 2: Effect of parameters on controller performance.**

**Fig. 3: Performance profile of the reinforcement-learning controller in simulation.**

**Fig. 4: Flight characteristics of controllers tested in the Pacific Ocean experiment.**

Machine learning for flow-informed aerodynamic control in turbulent wind conditions

Article Open access 16 December 2022

Automating turbulence modelling by multi-agent reinforcement learning

Article 04 January 2021

Learning efficient navigation in vortical flow fields

Article Open access 08 December 2021

Data availability

The data analysed in this paper are available from the corresponding authors on reasonable request.

Code availability

The code used to train the flight controllers is proprietary. The code used to analyse the generated data is available from the corresponding authors on reasonable request.

References

Lally, V. E. Superpressure Balloons for Horizontal Soundings of the Atmosphere Technical report (National Center for Atmospheric Research, 1967).
Anderson, B. & Moore, B. J. Optimal Control: Linear Quadratic Methods (Prentice-Hall, 1989).
Camacho, E. F. & Bordons, C. Model Predictive Control (Springer, 2007).
Bellman, R. E. Dynamic Programming (Princeton Univ. Press, 1957).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).
Jakobi, N., Husbands, P. & Harvey, I. Noise and the reality gap: the use of simulation in evolutionary robotics. In Proc. European Conf. Artificial Life (eds Moran, F. et al.) 704–720 (Springer, 1995).
Tobin, J. et al. Domain randomization and generative models for robotic grasping. In Proc. Intl Conf. Intelligent Robots and Systems 3482–3489 (IEEE, 2018).
Levine, S., Kumar, A., Tucker, G. & Fu, J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. Preprint at https://arxiv.org/abs/2005.01643 (2020).
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998).
Article MathSciNet Google Scholar
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article ADS CAS Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS CAS Google Scholar
Lauer, C. J., Montgomery, C. A. & Dietterich, T. G. Managing fragmented fire-threatened landscapes with spatial externalities. For. Sci. 66, 443–456 (2020).
Article Google Scholar
Simão, H. P. et al. An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transport. Sci. 43, 178–197 (2009).
Article Google Scholar
Mannion, P., Duggan, J. & Howley, E. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic Road Transport Support Systems (eds McCluskey, T. L. et al.) 47–66 (Springer, 2016).
Mirhoseini, A. et al. Chip placement with deep reinforcement learning. Preprint at https://arxiv.org/abs/2004.10746 (2020).
Nevmyvaka, Y., Feng, Y. & Kearns, M. Reinforcement learning for optimized trade execution. In Proc. Intl Conf. Machine Learning (eds Cohen, W. W. & Moore, A.) 673–680 (ACM, 2006).
Pineau, J., Bellemare, M. G., Rush, A. J., Ghizaru, A. & Murphy, S. A. Constructing evidence-based treatment strategies using methods from computer science. Drug Alcohol Depend. 88, S52–S60 (2007).
Article Google Scholar
Anderson, R. N., Boulanger, A., Powell, W. B. & Scott, W. Adaptive stochastic control for the smart grid. Proc. IEEE 99, 1098–1115 (2011).
Article Google Scholar
Glavic, M., Fonteneau, R. & Ernst, D. Reinforcement learning for electric power system decision and control: past considerations and perspectives. IFAC PapersOnLine 50, 6918–6927 (2017).
Article Google Scholar
Theocharous, G., Thomas, P. S. & Ghavazamdeh, M. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proc. Intl Joint Conf. Artificial Intelligence (eds Yang, Q. & Wooldridge, M.) 1806–1812 (AAAI Press, IJCAI, 2015).
Ie, E. et al. SlateQ: a tractable decomposition for reinforcement learning with recommendation sets. In Proc. Intl Joint Conf. Artificial Intelligence (ed. Kraus, S.) 2592–2599 (IJCAI, 2019).
Ross, S., Gordon, G. & Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. 14th Intl Conf. Artificial Intelligence and Statistics, (eds Gordon, G. et al.) 627–635 (PMLR, 2011).
Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Proc. Robotics: Science and Systems XIV (eds Kress-Gazir, H. et al.) 10 (2018).
Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 16 (NIPS 2003) (eds Saul, L. K. et al.) 799–806 (2004).
Abbeel, P., Coates, A., Quigley, M. & Ng, A. Y. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems 19 (NIPS 2006) (eds Schölkopf, B. et al.) 1–8 (MIT Press, 2007).
Reddy, G., Wong-Ng, J., Celani, A., Sejnowski, T. J. & Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 562, 236–239 (2018).
Article ADS CAS Google Scholar
Lange, S., Riedmiller, M. & Voigtländer, A. Autonomous reinforcement learning on raw visual input data in a real world application. In Proc. Intl Joint Conf. Neural Networks https://doi.org/10.1109/IJCNN.2012.6252823 (IEEE, 2012).
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J. & Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37, 421–436 (2018).
Article Google Scholar
Kalashnikov, D. et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proc. Conf. Robot Learning Vol. 87 (eds Billard, A. et al.) 651–673 (PMLR, 2018).
Andrychowicz, O. M. et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39, 3–20 (2020).
Article Google Scholar
Zhang, C. Madden–Julian Oscillation. Rev. Geophys. 43, RG2003 (2005).
ADS Google Scholar
Domeisen, D. I., Garfinkel, C. I. & Butler, A. H. The teleconnection of El Niño Southern Oscillation to the stratosphere. Rev. Geophys. 57, 5–47 (2018).
Article ADS Google Scholar
Baldwin, M. et al. The quasi-biennial oscillation. Rev. Geophys. 39, 179–229 (2001).
Article ADS Google Scholar
Friedrich, L. S. et al. A comparison of Loon balloon observations and stratospheric reanalysis products. Atmos. Chem. Phys. 17, 855–866 (2017).
Article ADS CAS Google Scholar
Coy, L., Schoeberl, M. R., Pawson, S., Candido, S. & Carver, R. W. Global assimilation of Loon stratospheric balloon observations. J. Geophys. Res. D 124, 3005–3019 (2019).
Article ADS Google Scholar
Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).
Sondik, E. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford Univ. (1971).
Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146, 1999–2049 (2020).
Article ADS Google Scholar
Perlin, K. An image synthesizer. Comput. Graph. 19, 287–296 (1985).
Article Google Scholar
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The Arcade Learning Environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Article Google Scholar
Kolter, Z. J. & Ng, A. Y. Policy search via the signed derivative. In Proc. Robotics: Science and Systems V (eds Trinkle, J. et al.) 27 (MIT Press, 2009).
Levine, S. & Koltun, V. Guided policy search. In Proc. Intl Conf. Machine Learning Vol. 28-3 (eds Dasgupta, S. & McAllester, D.) 1–9 (ICML, 2013).
Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992).
Google Scholar
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th Intl Conf. Machine Learning (ed. Fürnkranz, J.) 807–814 (ICML, 2010).
Dabney, W., Rowland, M., Bellemare, M. G. & Munos, R. Distributional reinforcement learning with quantile regression. In Proc. AAAI Conf. Artificial Intelligence 2892–2901 (AAAI Press, 2018).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 48 (eds Balcan, M.-F. &Weinberger, K. Q.) 1928–1937 (ICML, 2016).
Munos, R. From bandits to Monte-Carlo tree search: the optimistic principle applied to optimization and planning. Found. Trends Mach. Learn. 7, 1–129 (2014).
Article ADS Google Scholar
Gibson, J. J. The Ecological Approach to Visual Perception (Taylor & Francis, 1979).
Brooks, R. Elephants don’t play chess. Robot. Auton. Syst. 6, 3–15 (1990).
Article Google Scholar
Alexander, M., Grimsdell, A., Stephan, C. & Hoffmann, L. MJO-related intraseasonal variation in the stratosphere: gravity waves and zonal winds. J. Geophys. Res. D Atmospheres 123, 775–788 (2018).
Article ADS Google Scholar
Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, Cambridge Univ. (1989).
Castro, P. S., Moitra, S., Gelada, C., Kumar, S. & Bellemare, M. G. Dopamine: a research framework for deep reinforcement learning. Preprint at https://arxiv.org/abs/1812.06110 (2018).
Bellemare, M. G., Dabney, W. & Munos, R. A distributional perspective on reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 449–458 (PMLR, 2017).
Kingma, D. & Ba, J. Adam: A method for stochastic optimization. In Proc. Intl Conf. Learning Representations (eds Benigo, Y. & LeCun, Y.) (2015).
Golovin, D. et al. Google Vizier: a service for black-box optimization. In Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (eds Matwin, S. et al.) 1487–1496 (ACM, 2017).

Download references

Acknowledgements

We are grateful to J. Davidson for early ideation and prototyping. We thank V. Vanhoucke, K. Choromanski, V. Sindhwani, C. Boutilier, D. Precup, S. Mourad, S. Levine, K. Murphy, A. Faust, H. Larochelle and J. Platt for discussions; M. Bowling, A. Guez, D. Tarlow, J. Drouin, and M. Brenner for feedback on earlier versions of the manuscript; W. Dabney for feedback and help with design; T. Larivee for help with visuals; N. Mainville for project management support; R. Carver for information on weather phenomena; and the Loon operations team.

Author information

Authors and Affiliations

Brain Team, Google Research, Montreal, Quebec, Canada
Marc G. Bellemare, Pablo Samuel Castro, Marlos C. Machado & Subhodeep Moitra
Brain Team, Google Research, Toronto, Ontario, Canada
Ziyu Wang
Loon, Mountain View, CA, USA
Salvatore Candido, Jun Gong & Sameera S. Ponda

Authors

Marc G. Bellemare
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Candido
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Samuel Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jun Gong
View author publications
You can also search for this author in PubMed Google Scholar
Marlos C. Machado
View author publications
You can also search for this author in PubMed Google Scholar
Subhodeep Moitra
View author publications
You can also search for this author in PubMed Google Scholar
Sameera S. Ponda
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.C. conceptualized the problem. J.G., S.C., S.S.P., P.S.C. and S.M. built the technical infrastructure. S.C., M.G.B., J.G., M.C.M. and P.S.C. developed and tested the algorithm. M.C.M., M.G.B., P.S.C., S.C., S.S.P., J.G. and Z.W. performed experimentation and data analysis. M.G.B. and S.C. managed the project. M.G.B., S.C., M.C.M., P.S.C. and S.S.P. wrote the paper. Authors are listed alphabetically by surname.

Corresponding authors

Correspondence to Marc G. Bellemare or Salvatore Candido.

Ethics declarations

Competing interests

M.G.B., S.C., J.G. and M.C.M. have filed patent applications relating to navigating aerial vehicles using deep reinforcement learning. The remaining authors declare no competing financial interests.

Additional information

Peer review information Nature thanks Scott Osprey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Flight paths of the reinforcement-learning controller during the Pacific Ocean experiment.

The x and y axes represent longitude and latitude.

Extended Data Fig. 2 Flight paths of StationSeeker during the Pacific Ocean experiment, 1 of 2.

The x and y axes represent longitude and latitude, respectively.

Extended Data Fig. 3 Flight paths of StationSeeker during the Pacific Ocean experiment, 2 of 2.

The x and y axes represent longitude and latitude, respectively.

Extended Data Fig. 4 TWR50 and power consumption for different parametrizations of StationSeeker’s score function.

Grey points indicate settings chosen uniformly at random from the following ranges: w_Δ ∈ [0.4, 0.8] (at close range), k₁ ∈ [0.01, 0.15], g_unknown ∈ [0.4, 0.6] and k₂ ∈ [0, 0.2]. Each parameter was also varied in isolation (coloured points). Semantically interesting parameter choices are highlighted.

Extended Data Fig. 5 Distribution of returns predicted by the neural network.

Each panel indicates the predicted distributions for a particular state and action. The 51 quantiles output by the network are smoothed using kernel density estimation (σ determined from Scott’s rule with interquartile range scaling). The dashed lines indicate the average of these locations. The states with depicted distributions are from different times (0, 3, 6 and 9 h) into the July 2002 simulation. We use the largest quantile to estimate the return that could be realized in the absence of partial observability.

Extended Data Fig. 6 Average distances and pairwise distances for perturbations of 12 initial conditions.

a, Distance to station, averaged over 125 perturbations. These numbers highlight how the 1 January to 1 June 2002 simulations (May excluded) were challenging station-keeping conditions. The 1 January configuration, in particular, lacked wind diversity. b, Average distance between pairs of balloons (7,750 pairs). Our controller exhibits greater robustness to challenging conditions.

Extended Data Fig. 7 Scaled response of controllers to wind bearing and magnitude as a function of distance.

We use the derivative of the network’s action-value estimates, or response, as a proxy for the relative weight of an input. The two inputs tested here are the wind bearing and magnitude at the balloon’s altitude; the curves report the derivative for the ‘stay’ action.

Extended Data Table 1 Inputs to the flight controller

Full size table

Extended Data Table 2 Hyperparameters defining the deep reinforcement-learning algorithm

Full size table

Supplementary information

Video 1

: Simulation of 125 balloons station keeping in challenging conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. The station is denoted by an 'X', and the 50 km range by a dashed line. Unlike StationSeeker, the learned controller is able to remain near the station irrespective of initial conditions, despite a highly challenging wind field. It achieves this by navigating away from the station to avoid strong winds and remain in a relatively calm area, visible from 0:06 into the video.

Video 2

: Simulation of 125 balloons station keeping in easy conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. In this relatively easy scenario, the learned controller arrives at the station faster than StationSeeker and regroups more quickly after the second night. Its performance is generally less dependent on the initial condition.

Video 3

: Flight #16 from the Pacific Ocean experiment. This video depicts flight #16 (learned controller) over the Pacific Ocean. The right panel depicts the controller's observed wind column, with colour representing uncertainty. StationSeeker's proposed choices are highlighted in that panel. When possible, the learned controller remains stationary by remaining at the interface between opposing wind sheets. Its station keeping patterns make use of the full 50 km range, providing it with additional energy late into the night.

.Peer Review File

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bellemare, M.G., Candido, S., Castro, P.S. et al. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77–82 (2020). https://doi.org/10.1038/s41586-020-2939-8

Download citation

Received: 01 April 2020
Accepted: 29 September 2020
Published: 02 December 2020
Issue Date: 03 December 2020
DOI: https://doi.org/10.1038/s41586-020-2939-8

This article is cited by

Stable training via elastic adaptive deep reinforcement learning for autonomous navigation of intelligent vehicles
- Yujiao Zhao
- Yong Ma
- Xinping Yan
Communications Engineering (2024)
Personalization for web-based services using offline reinforcement learning
- Pavlos Athanasios Apostolopoulos
- Zehui Wang
- Igor L. Markov
Machine Learning (2024)
Adaptable control policies for variable liquid chromatography columns using deep reinforcement learning
- David Andersson
- Christoffer Edlund
- Rickard Sjögren
Scientific Reports (2023)
Optimal tracking strategies in a turbulent flow
- Chiara Calascibetta
- Luca Biferale
- Massimo Cencini
Communications Physics (2023)
Realizing a deep reinforcement learning agent for real-time quantum feedback
- Kevin Reuer
- Jonas Landgraf
- Christopher Eichler
Nature Communications (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.