Efficiently navigating a superpressure balloon in the stratosphere1 requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques2,3. Here we describe the use of reinforcement learning4,5 to create a high-performing flight controller. Our algorithm uses data augmentation6,7 and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems8. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.
Subscribe to Journal
Get full journal access for 1 year
only $3.90 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The data analysed in this paper are available from the corresponding authors on reasonable request.
The code used to train the flight controllers is proprietary. The code used to analyse the generated data is available from the corresponding authors on reasonable request.
Lally, V. E. Superpressure Balloons for Horizontal Soundings of the Atmosphere Technical report (National Center for Atmospheric Research, 1967).
Anderson, B. & Moore, B. J. Optimal Control: Linear Quadratic Methods (Prentice-Hall, 1989).
Camacho, E. F. & Bordons, C. Model Predictive Control (Springer, 2007).
Bellman, R. E. Dynamic Programming (Princeton Univ. Press, 1957).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).
Jakobi, N., Husbands, P. & Harvey, I. Noise and the reality gap: the use of simulation in evolutionary robotics. In Proc. European Conf. Artificial Life (eds Moran, F. et al.) 704–720 (Springer, 1995).
Tobin, J. et al. Domain randomization and generative models for robotic grasping. In Proc. Intl Conf. Intelligent Robots and Systems 3482–3489 (IEEE, 2018).
Levine, S., Kumar, A., Tucker, G. & Fu, J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. Preprint at https://arxiv.org/abs/2005.01643 (2020).
Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998).
Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Lauer, C. J., Montgomery, C. A. & Dietterich, T. G. Managing fragmented fire-threatened landscapes with spatial externalities. For. Sci. 66, 443–456 (2020).
Simão, H. P. et al. An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transport. Sci. 43, 178–197 (2009).
Mannion, P., Duggan, J. & Howley, E. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic Road Transport Support Systems (eds McCluskey, T. L. et al.) 47–66 (Springer, 2016).
Mirhoseini, A. et al. Chip placement with deep reinforcement learning. Preprint at https://arxiv.org/abs/2004.10746 (2020).
Nevmyvaka, Y., Feng, Y. & Kearns, M. Reinforcement learning for optimized trade execution. In Proc. Intl Conf. Machine Learning (eds Cohen, W. W. & Moore, A.) 673–680 (ACM, 2006).
Pineau, J., Bellemare, M. G., Rush, A. J., Ghizaru, A. & Murphy, S. A. Constructing evidence-based treatment strategies using methods from computer science. Drug Alcohol Depend. 88, S52–S60 (2007).
Anderson, R. N., Boulanger, A., Powell, W. B. & Scott, W. Adaptive stochastic control for the smart grid. Proc. IEEE 99, 1098–1115 (2011).
Glavic, M., Fonteneau, R. & Ernst, D. Reinforcement learning for electric power system decision and control: past considerations and perspectives. IFAC PapersOnLine 50, 6918–6927 (2017).
Theocharous, G., Thomas, P. S. & Ghavazamdeh, M. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proc. Intl Joint Conf. Artificial Intelligence (eds Yang, Q. & Wooldridge, M.) 1806–1812 (AAAI Press, IJCAI, 2015).
Ie, E. et al. SlateQ: a tractable decomposition for reinforcement learning with recommendation sets. In Proc. Intl Joint Conf. Artificial Intelligence (ed. Kraus, S.) 2592–2599 (IJCAI, 2019).
Ross, S., Gordon, G. & Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. 14th Intl Conf. Artificial Intelligence and Statistics, (eds Gordon, G. et al.) 627–635 (PMLR, 2011).
Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Proc. Robotics: Science and Systems XIV (eds Kress-Gazir, H. et al.) 10 (2018).
Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 16 (NIPS 2003) (eds Saul, L. K. et al.) 799–806 (2004).
Abbeel, P., Coates, A., Quigley, M. & Ng, A. Y. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems 19 (NIPS 2006) (eds Schölkopf, B. et al.) 1–8 (MIT Press, 2007).
Reddy, G., Wong-Ng, J., Celani, A., Sejnowski, T. J. & Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 562, 236–239 (2018).
Lange, S., Riedmiller, M. & Voigtländer, A. Autonomous reinforcement learning on raw visual input data in a real world application. In Proc. Intl Joint Conf. Neural Networks https://doi.org/10.1109/IJCNN.2012.6252823 (IEEE, 2012).
Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J. & Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37, 421–436 (2018).
Kalashnikov, D. et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proc. Conf. Robot Learning Vol. 87 (eds Billard, A. et al.) 651–673 (PMLR, 2018).
Andrychowicz, O. M. et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39, 3–20 (2020).
Zhang, C. Madden–Julian Oscillation. Rev. Geophys. 43, RG2003 (2005).
Domeisen, D. I., Garfinkel, C. I. & Butler, A. H. The teleconnection of El Niño Southern Oscillation to the stratosphere. Rev. Geophys. 57, 5–47 (2018).
Baldwin, M. et al. The quasi-biennial oscillation. Rev. Geophys. 39, 179–229 (2001).
Friedrich, L. S. et al. A comparison of Loon balloon observations and stratospheric reanalysis products. Atmos. Chem. Phys. 17, 855–866 (2017).
Coy, L., Schoeberl, M. R., Pawson, S., Candido, S. & Carver, R. W. Global assimilation of Loon stratospheric balloon observations. J. Geophys. Res. D 124, 3005–3019 (2019).
Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).
Sondik, E. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford Univ. (1971).
Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146, 1999–2049 (2020).
Perlin, K. An image synthesizer. Comput. Graph. 19, 287–296 (1985).
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The Arcade Learning Environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Kolter, Z. J. & Ng, A. Y. Policy search via the signed derivative. In Proc. Robotics: Science and Systems V (eds Trinkle, J. et al.) 27 (MIT Press, 2009).
Levine, S. & Koltun, V. Guided policy search. In Proc. Intl Conf. Machine Learning Vol. 28-3 (eds Dasgupta, S. & McAllester, D.) 1–9 (ICML, 2013).
Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992).
Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th Intl Conf. Machine Learning (ed. Fürnkranz, J.) 807–814 (ICML, 2010).
Dabney, W., Rowland, M., Bellemare, M. G. & Munos, R. Distributional reinforcement learning with quantile regression. In Proc. AAAI Conf. Artificial Intelligence 2892–2901 (AAAI Press, 2018).
Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 48 (eds Balcan, M.-F. &Weinberger, K. Q.) 1928–1937 (ICML, 2016).
Munos, R. From bandits to Monte-Carlo tree search: the optimistic principle applied to optimization and planning. Found. Trends Mach. Learn. 7, 1–129 (2014).
Gibson, J. J. The Ecological Approach to Visual Perception (Taylor & Francis, 1979).
Brooks, R. Elephants don’t play chess. Robot. Auton. Syst. 6, 3–15 (1990).
Alexander, M., Grimsdell, A., Stephan, C. & Hoffmann, L. MJO-related intraseasonal variation in the stratosphere: gravity waves and zonal winds. J. Geophys. Res. D Atmospheres 123, 775–788 (2018).
Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, Cambridge Univ. (1989).
Castro, P. S., Moitra, S., Gelada, C., Kumar, S. & Bellemare, M. G. Dopamine: a research framework for deep reinforcement learning. Preprint at https://arxiv.org/abs/1812.06110 (2018).
Bellemare, M. G., Dabney, W. & Munos, R. A distributional perspective on reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 449–458 (PMLR, 2017).
Kingma, D. & Ba, J. Adam: A method for stochastic optimization. In Proc. Intl Conf. Learning Representations (eds Benigo, Y. & LeCun, Y.) (2015).
Golovin, D. et al. Google Vizier: a service for black-box optimization. In Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (eds Matwin, S. et al.) 1487–1496 (ACM, 2017).
We are grateful to J. Davidson for early ideation and prototyping. We thank V. Vanhoucke, K. Choromanski, V. Sindhwani, C. Boutilier, D. Precup, S. Mourad, S. Levine, K. Murphy, A. Faust, H. Larochelle and J. Platt for discussions; M. Bowling, A. Guez, D. Tarlow, J. Drouin, and M. Brenner for feedback on earlier versions of the manuscript; W. Dabney for feedback and help with design; T. Larivee for help with visuals; N. Mainville for project management support; R. Carver for information on weather phenomena; and the Loon operations team.
M.G.B., S.C., J.G. and M.C.M. have filed patent applications relating to navigating aerial vehicles using deep reinforcement learning. The remaining authors declare no competing financial interests.
Peer review information Nature thanks Scott Osprey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Flight paths of the reinforcement-learning controller during the Pacific Ocean experiment.
The x and y axes represent longitude and latitude.
The x and y axes represent longitude and latitude, respectively.
The x and y axes represent longitude and latitude, respectively.
Extended Data Fig. 4 TWR50 and power consumption for different parametrizations of StationSeeker’s score function.
Grey points indicate settings chosen uniformly at random from the following ranges: wΔ ∈ [0.4, 0.8] (at close range), k1 ∈ [0.01, 0.15], gunknown ∈ [0.4, 0.6] and k2 ∈ [0, 0.2]. Each parameter was also varied in isolation (coloured points). Semantically interesting parameter choices are highlighted.
Each panel indicates the predicted distributions for a particular state and action. The 51 quantiles output by the network are smoothed using kernel density estimation (σ determined from Scott’s rule with interquartile range scaling). The dashed lines indicate the average of these locations. The states with depicted distributions are from different times (0, 3, 6 and 9 h) into the July 2002 simulation. We use the largest quantile to estimate the return that could be realized in the absence of partial observability.
Extended Data Fig. 6 Average distances and pairwise distances for perturbations of 12 initial conditions.
a, Distance to station, averaged over 125 perturbations. These numbers highlight how the 1 January to 1 June 2002 simulations (May excluded) were challenging station-keeping conditions. The 1 January configuration, in particular, lacked wind diversity. b, Average distance between pairs of balloons (7,750 pairs). Our controller exhibits greater robustness to challenging conditions.
Extended Data Fig. 7 Scaled response of controllers to wind bearing and magnitude as a function of distance.
We use the derivative of the network’s action-value estimates, or response, as a proxy for the relative weight of an input. The two inputs tested here are the wind bearing and magnitude at the balloon’s altitude; the curves report the derivative for the ‘stay’ action.
: Simulation of 125 balloons station keeping in challenging conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. The station is denoted by an 'X', and the 50 km range by a dashed line. Unlike StationSeeker, the learned controller is able to remain near the station irrespective of initial conditions, despite a highly challenging wind field. It achieves this by navigating away from the station to avoid strong winds and remain in a relatively calm area, visible from 0:06 into the video.
: Simulation of 125 balloons station keeping in easy conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. In this relatively easy scenario, the learned controller arrives at the station faster than StationSeeker and regroups more quickly after the second night. Its performance is generally less dependent on the initial condition.
: Flight #16 from the Pacific Ocean experiment. This video depicts flight #16 (learned controller) over the Pacific Ocean. The right panel depicts the controller's observed wind column, with colour representing uncertainty. StationSeeker's proposed choices are highlighted in that panel. When possible, the learned controller remains stationary by remaining at the interface between opposing wind sheets. Its station keeping patterns make use of the full 50 km range, providing it with additional energy late into the night.
About this article
Cite this article
Bellemare, M.G., Candido, S., Castro, P.S. et al. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77–82 (2020). https://doi.org/10.1038/s41586-020-2939-8