Autonomous navigation of stratospheric balloons using reinforcement learning

Abstract

Efficiently navigating a superpressure balloon in the stratosphere1 requires the integration of a multitude of cues, such as wind speed and solar elevation, and the process is complicated by forecast errors and sparse wind measurements. Coupled with the need to make decisions in real time, these factors rule out the use of conventional control techniques2,3. Here we describe the use of reinforcement learning4,5 to create a high-performing flight controller. Our algorithm uses data augmentation6,7 and a self-correcting design to overcome the key technical challenge of reinforcement learning from imperfect data, which has proved to be a major obstacle to its application to physical systems8. We deployed our controller to station Loon superpressure balloons at multiple locations across the globe, including a 39-day controlled experiment over the Pacific Ocean. Analyses show that the controller outperforms Loon’s previous algorithm and is robust to the natural diversity in stratospheric winds. These results demonstrate that reinforcement learning is an effective solution to real-world autonomous control problems in which neither conventional methods nor human intervention suffice, offering clues about what may be needed to create artificially intelligent agents that continuously interact with real, dynamic environments.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Station-keeping with a superpressure balloon.
Fig. 2: Effect of parameters on controller performance.
Fig. 3: Performance profile of the reinforcement-learning controller in simulation.
Fig. 4: Flight characteristics of controllers tested in the Pacific Ocean experiment.

Data availability

The data analysed in this paper are available from the corresponding authors on reasonable request.

Code availability

The code used to train the flight controllers is proprietary. The code used to analyse the generated data is available from the corresponding authors on reasonable request.

References

  1. 1.

    Lally, V. E. Superpressure Balloons for Horizontal Soundings of the Atmosphere Technical report (National Center for Atmospheric Research, 1967).

  2. 2.

    Anderson, B. & Moore, B. J. Optimal Control: Linear Quadratic Methods (Prentice-Hall, 1989).

  3. 3.

    Camacho, E. F. & Bordons, C. Model Predictive Control (Springer, 2007).

  4. 4.

    Bellman, R. E. Dynamic Programming (Princeton Univ. Press, 1957).

  5. 5.

    Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction 2nd edn (MIT Press, 2018).

  6. 6.

    Jakobi, N., Husbands, P. & Harvey, I. Noise and the reality gap: the use of simulation in evolutionary robotics. In Proc. European Conf. Artificial Life (eds Moran, F. et al.) 704–720 (Springer, 1995).

  7. 7.

    Tobin, J. et al. Domain randomization and generative models for robotic grasping. In Proc. Intl Conf. Intelligent Robots and Systems 3482–3489 (IEEE, 2018).

  8. 8.

    Levine, S., Kumar, A., Tucker, G. & Fu, J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. Preprint at https://arxiv.org/abs/2005.01643 (2020).

  9. 9.

    Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998).

    MathSciNet  Article  Google Scholar 

  10. 10.

    Tesauro, G. Temporal difference learning and TD-Gammon. Commun. ACM 38, 58–68 (1995).

    Article  Google Scholar 

  11. 11.

    Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).

    ADS  CAS  Article  Google Scholar 

  12. 12.

    Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).

    ADS  CAS  Article  Google Scholar 

  13. 13.

    Lauer, C. J., Montgomery, C. A. & Dietterich, T. G. Managing fragmented fire-threatened landscapes with spatial externalities. For. Sci. 66, 443–456 (2020).

    Article  Google Scholar 

  14. 14.

    Simão, H. P. et al. An approximate dynamic programming algorithm for large-scale fleet management: a case application. Transport. Sci. 43, 178–197 (2009).

    Article  Google Scholar 

  15. 15.

    Mannion, P., Duggan, J. & Howley, E. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In Autonomic Road Transport Support Systems (eds McCluskey, T. L. et al.) 47–66 (Springer, 2016).

  16. 16.

    Mirhoseini, A. et al. Chip placement with deep reinforcement learning. Preprint at https://arxiv.org/abs/2004.10746 (2020).

  17. 17.

    Nevmyvaka, Y., Feng, Y. & Kearns, M. Reinforcement learning for optimized trade execution. In Proc. Intl Conf. Machine Learning (eds Cohen, W. W. & Moore, A.) 673–680 (ACM, 2006).

  18. 18.

    Pineau, J., Bellemare, M. G., Rush, A. J., Ghizaru, A. & Murphy, S. A. Constructing evidence-based treatment strategies using methods from computer science. Drug Alcohol Depend. 88, S52–S60 (2007).

    Article  Google Scholar 

  19. 19.

    Anderson, R. N., Boulanger, A., Powell, W. B. & Scott, W. Adaptive stochastic control for the smart grid. Proc. IEEE 99, 1098–1115 (2011).

    Article  Google Scholar 

  20. 20.

    Glavic, M., Fonteneau, R. & Ernst, D. Reinforcement learning for electric power system decision and control: past considerations and perspectives. IFAC PapersOnLine 50, 6918–6927 (2017).

    Article  Google Scholar 

  21. 21.

    Theocharous, G., Thomas, P. S. & Ghavazamdeh, M. Personalized ad recommendation systems for life-time value optimization with guarantees. In Proc. Intl Joint Conf. Artificial Intelligence (eds Yang, Q. & Wooldridge, M.) 1806–1812 (AAAI Press, IJCAI, 2015).

  22. 22.

    Ie, E. et al. SlateQ: a tractable decomposition for reinforcement learning with recommendation sets. In Proc. Intl Joint Conf. Artificial Intelligence (ed. Kraus, S.) 2592–2599 (IJCAI, 2019).

  23. 23.

    Ross, S., Gordon, G. & Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proc. 14th Intl Conf. Artificial Intelligence and Statistics, (eds Gordon, G. et al.) 627–635 (PMLR, 2011).

  24. 24.

    Tan, J. et al. Sim-to-real: learning agile locomotion for quadruped robots. In Proc. Robotics: Science and Systems XIV (eds Kress-Gazir, H. et al.) 10 (2018).

  25. 25.

    Ng, A. Y., Kim, H. J., Jordan, M. I. & Sastry, S. Autonomous helicopter flight via reinforcement learning. In Advances in Neural Information Processing Systems 16 (NIPS 2003) (eds Saul, L. K. et al.) 799–806 (2004).

  26. 26.

    Abbeel, P., Coates, A., Quigley, M. & Ng, A. Y. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems 19 (NIPS 2006) (eds Schölkopf, B. et al.) 1–8 (MIT Press, 2007).

  27. 27.

    Reddy, G., Wong-Ng, J., Celani, A., Sejnowski, T. J. & Vergassola, M. Glider soaring via reinforcement learning in the field. Nature 562, 236–239 (2018).

    ADS  CAS  Article  Google Scholar 

  28. 28.

    Lange, S., Riedmiller, M. & Voigtländer, A. Autonomous reinforcement learning on raw visual input data in a real world application. In Proc. Intl Joint Conf. Neural Networks https://doi.org/10.1109/IJCNN.2012.6252823 (IEEE, 2012).

  29. 29.

    Levine, S., Pastor, P., Krizhevsky, A., Ibarz, J. & Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 37, 421–436 (2018).

    Article  Google Scholar 

  30. 30.

    Kalashnikov, D. et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proc. Conf. Robot Learning Vol. 87 (eds Billard, A. et al.) 651–673 (PMLR, 2018).

  31. 31.

    Andrychowicz, O. M. et al. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39, 3–20 (2020).

    Article  Google Scholar 

  32. 32.

    Zhang, C. Madden–Julian Oscillation. Rev. Geophys. 43, RG2003 (2005).

    ADS  Google Scholar 

  33. 33.

    Domeisen, D. I., Garfinkel, C. I. & Butler, A. H. The teleconnection of El Niño Southern Oscillation to the stratosphere. Rev. Geophys. 57, 5–47 (2018).

    ADS  Article  Google Scholar 

  34. 34.

    Baldwin, M. et al. The quasi-biennial oscillation. Rev. Geophys. 39, 179–229 (2001).

    ADS  Article  Google Scholar 

  35. 35.

    Friedrich, L. S. et al. A comparison of Loon balloon observations and stratospheric reanalysis products. Atmos. Chem. Phys. 17, 855–866 (2017).

    ADS  CAS  Article  Google Scholar 

  36. 36.

    Coy, L., Schoeberl, M. R., Pawson, S., Candido, S. & Carver, R. W. Global assimilation of Loon stratospheric balloon observations. J. Geophys. Res. D 124, 3005–3019 (2019).

    ADS  Article  Google Scholar 

  37. 37.

    Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).

  38. 38.

    Sondik, E. The Optimal Control of Partially Observable Markov Processes. PhD thesis, Stanford Univ. (1971).

  39. 39.

    Hersbach, H. et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 146, 1999–2049 (2020).

    ADS  Article  Google Scholar 

  40. 40.

    Perlin, K. An image synthesizer. Comput. Graph. 19, 287–296 (1985).

    Article  Google Scholar 

  41. 41.

    Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The Arcade Learning Environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).

    Article  Google Scholar 

  42. 42.

    Kolter, Z. J. & Ng, A. Y. Policy search via the signed derivative. In Proc. Robotics: Science and Systems V (eds Trinkle, J. et al.) 27 (MIT Press, 2009).

  43. 43.

    Levine, S. & Koltun, V. Guided policy search. In Proc. Intl Conf. Machine Learning Vol. 28-3 (eds Dasgupta, S. & McAllester, D.) 1–9 (ICML, 2013).

  44. 44.

    Lin, L. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach. Learn. 8, 293–321 (1992).

    Google Scholar 

  45. 45.

    Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th Intl Conf. Machine Learning (ed. Fürnkranz, J.) 807–814 (ICML, 2010).

  46. 46.

    Dabney, W., Rowland, M., Bellemare, M. G. & Munos, R. Distributional reinforcement learning with quantile regression. In Proc. AAAI Conf. Artificial Intelligence 2892–2901 (AAAI Press, 2018).

  47. 47.

    Mnih, V. et al. Asynchronous methods for deep reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 48 (eds Balcan, M.-F. &Weinberger, K. Q.) 1928–1937 (ICML, 2016).

  48. 48.

    Munos, R. From bandits to Monte-Carlo tree search: the optimistic principle applied to optimization and planning. Found. Trends Mach. Learn. 7, 1–129 (2014).

    ADS  Article  Google Scholar 

  49. 49.

    Gibson, J. J. The Ecological Approach to Visual Perception (Taylor & Francis, 1979).

  50. 50.

    Brooks, R. Elephants don’t play chess. Robot. Auton. Syst. 6, 3–15 (1990).

    Article  Google Scholar 

  51. 51.

    Alexander, M., Grimsdell, A., Stephan, C. & Hoffmann, L. MJO-related intraseasonal variation in the stratosphere: gravity waves and zonal winds. J. Geophys. Res. D Atmospheres 123, 775–788 (2018).

    ADS  Article  Google Scholar 

  52. 52.

    Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, Cambridge Univ. (1989).

  53. 53.

    Castro, P. S., Moitra, S., Gelada, C., Kumar, S. & Bellemare, M. G. Dopamine: a research framework for deep reinforcement learning. Preprint at https://arxiv.org/abs/1812.06110 (2018).

  54. 54.

    Bellemare, M. G., Dabney, W. & Munos, R. A distributional perspective on reinforcement learning. In Proc. Intl Conf. Machine Learning Vol. 70 (eds Precup, D. & Teh, Y. W.) 449–458 (PMLR, 2017).

  55. 55.

    Kingma, D. & Ba, J. Adam: A method for stochastic optimization. In Proc. Intl Conf. Learning Representations (eds Benigo, Y. & LeCun, Y.) (2015).

  56. 56.

    Golovin, D. et al. Google Vizier: a service for black-box optimization. In Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (eds Matwin, S. et al.) 1487–1496 (ACM, 2017).

Download references

Acknowledgements

We are grateful to J. Davidson for early ideation and prototyping. We thank V. Vanhoucke, K. Choromanski, V. Sindhwani, C. Boutilier, D. Precup, S. Mourad, S. Levine, K. Murphy, A. Faust, H. Larochelle and J. Platt for discussions; M. Bowling, A. Guez, D. Tarlow, J. Drouin, and M. Brenner for feedback on earlier versions of the manuscript; W. Dabney for feedback and help with design; T. Larivee for help with visuals; N. Mainville for project management support; R. Carver for information on weather phenomena; and the Loon operations team.

Author information

Affiliations

Authors

Contributions

S.C. conceptualized the problem. J.G., S.C., S.S.P., P.S.C. and S.M. built the technical infrastructure. S.C., M.G.B., J.G., M.C.M. and P.S.C. developed and tested the algorithm. M.C.M., M.G.B., P.S.C., S.C., S.S.P., J.G. and Z.W. performed experimentation and data analysis. M.G.B. and S.C. managed the project. M.G.B., S.C., M.C.M., P.S.C. and S.S.P. wrote the paper. Authors are listed alphabetically by surname.

Corresponding authors

Correspondence to Marc G. Bellemare or Salvatore Candido.

Ethics declarations

Competing interests

M.G.B., S.C., J.G. and M.C.M. have filed patent applications relating to navigating aerial vehicles using deep reinforcement learning. The remaining authors declare no competing financial interests.

Additional information

Peer review information Nature thanks Scott Osprey and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Flight paths of the reinforcement-learning controller during the Pacific Ocean experiment.

The x and y axes represent longitude and latitude.

Extended Data Fig. 2 Flight paths of StationSeeker during the Pacific Ocean experiment, 1 of 2.

The x and y axes represent longitude and latitude, respectively.

Extended Data Fig. 3 Flight paths of StationSeeker during the Pacific Ocean experiment, 2 of 2.

The x and y axes represent longitude and latitude, respectively.

Extended Data Fig. 4 TWR50 and power consumption for different parametrizations of StationSeeker’s score function.

Grey points indicate settings chosen uniformly at random from the following ranges: wΔ [0.4, 0.8] (at close range), k1 [0.01, 0.15], gunknown [0.4, 0.6] and k2 [0, 0.2]. Each parameter was also varied in isolation (coloured points). Semantically interesting parameter choices are highlighted.

Extended Data Fig. 5 Distribution of returns predicted by the neural network.

Each panel indicates the predicted distributions for a particular state and action. The 51 quantiles output by the network are smoothed using kernel density estimation (σ determined from Scott’s rule with interquartile range scaling). The dashed lines indicate the average of these locations. The states with depicted distributions are from different times (0, 3, 6 and 9 h) into the July 2002 simulation. We use the largest quantile to estimate the return that could be realized in the absence of partial observability.

Extended Data Fig. 6 Average distances and pairwise distances for perturbations of 12 initial conditions.

a, Distance to station, averaged over 125 perturbations. These numbers highlight how the 1 January to 1 June 2002 simulations (May excluded) were challenging station-keeping conditions. The 1 January configuration, in particular, lacked wind diversity. b, Average distance between pairs of balloons (7,750 pairs). Our controller exhibits greater robustness to challenging conditions.

Extended Data Fig. 7 Scaled response of controllers to wind bearing and magnitude as a function of distance.

We use the derivative of the network’s action-value estimates, or response, as a proxy for the relative weight of an input. The two inputs tested here are the wind bearing and magnitude at the balloon’s altitude; the curves report the derivative for the ‘stay’ action.

Extended Data Table 1 Inputs to the flight controller
Extended Data Table 2 Hyperparameters defining the deep reinforcement-learning algorithm

Supplementary information

Video 1

: Simulation of 125 balloons station keeping in challenging conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. The station is denoted by an 'X', and the 50 km range by a dashed line. Unlike StationSeeker, the learned controller is able to remain near the station irrespective of initial conditions, despite a highly challenging wind field. It achieves this by navigating away from the station to avoid strong winds and remain in a relatively calm area, visible from 0:06 into the video.

Video 2

: Simulation of 125 balloons station keeping in easy conditions. Simulation of 125 balloons starting from perturbations of a single initial position, either using the learned controller or StationSeeker. In this relatively easy scenario, the learned controller arrives at the station faster than StationSeeker and regroups more quickly after the second night. Its performance is generally less dependent on the initial condition.

Video 3

: Flight #16 from the Pacific Ocean experiment. This video depicts flight #16 (learned controller) over the Pacific Ocean. The right panel depicts the controller's observed wind column, with colour representing uncertainty. StationSeeker's proposed choices are highlighted in that panel. When possible, the learned controller remains stationary by remaining at the interface between opposing wind sheets. Its station keeping patterns make use of the full 50 km range, providing it with additional energy late into the night.

.Peer Review File

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bellemare, M.G., Candido, S., Castro, P.S. et al. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77–82 (2020). https://doi.org/10.1038/s41586-020-2939-8

Download citation

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing