Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Navigating the development challenges in creating complex data systems

Abstract

Data science systems (DSSs) are a fundamental tool in many areas of research and are now being developed by people with a myriad of backgrounds. This is coupled with a crisis in the reproducibility of such DSSs, despite the wide availability of powerful tools for data science and machine learning over the past decade. We believe that perverse incentives and a lack of widespread software engineering skills are among the many causes of this crisis and analyse why software engineering and building large complex systems is, in general, hard. Based on these insights, we identify how software engineering addresses those difficulties and how one might apply and generalize software engineering methods to make DSSs more fit for purpose. We advocate two key development philosophies: one should incrementally grow—not plan then build—DSSs, and one should use two types of feedback loop during development—one that tests the code’s correctness and another that evaluates the code’s efficacy.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Consequences of a code that is or is not correct and is or is not efficacious.
Fig. 2: Visualization of bad and good software architectures.
Fig. 3: Visualization of Agile development.
Fig. 4: Illustration of how the usefulness of feedback loops depends on their alignment θ and cycle time t.
Fig. 5: When growing a DSS, you must be able to support the cherry on top as early as possible.

Similar content being viewed by others

References

  1. Haibe-Kains, B. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020).

    Article  Google Scholar 

  2. Pineau, J. et al. Improving reproducibility in machine learning research: a report from the neurIPS 2019 reproducibility program. J. Mach. Learn. Res. 22, 7459–7478 (2021).

    MATH  Google Scholar 

  3. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

    Article  Google Scholar 

  4. Karpathy, A. A Recipe for Training Neural Networks; https://karpathy.github.io/2019/04/25/recipe/ (2019).

  5. Aboumatar, H. & Wise, R. A. Notice of retraction. Aboumatar et al. Effect of a program combining transitional care and long-term self-management support on outcomes of hospitalized patients with chronic obstructive pulmonary disease: a randomized clinical trial. JAMA. 2018;320(22):2335–2343. JAMA 322, 1417–1418 (2019).

  6. Bhandari Neupane, J. et al. Characterization of leptazolines A-D, polar oxazolines from the Cyanobacterium leptolyngbya sp., reveals a glitch with the ‘Willoughby-Hoye’ scripts for calculating NMR chemical shifts. Org. Lett. 21, 8449–8453 (2019).

    Article  Google Scholar 

  7. Gall, J. General Systemantics (General Systemantics Press, 1975).

  8. Brabban, P., Case, S., Cutts, S., Diniz, C. & Crawford, L. Data Pipeline Playbook; https://data-pipeline.playbook.ee/ (2021).

  9. Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3, 199–217 (2021).

    Article  Google Scholar 

  10. Parnas, D. L. On the criteria to be used in decomposing systems into modules. Commun. ACM 15, 1053–1058 (1972).

    Article  Google Scholar 

  11. Sutherland, J. & Sutherland, J. V. Scrum: The Art of Doing Twice the Work in Half the Time (Currency, 2014).

  12. Fowler, M. & Highsmith, J. et al. The Agile manifesto. Software Dev. 9, 28–35 (2001).

    Google Scholar 

  13. Farley, D. Modern Software Engineering: Doing What Works to Build Better Software Faster (Addison-Wesley, 2021).

  14. Bass, L., Clements, P. & Kazman, R. Software Architecture in Practice (Addison-Wesley, 2003).

  15. Reddy, V. S. The SpaceX effect. New Space 6, 125–134 (2018).

    Article  Google Scholar 

  16. Vance, A. & Sanders, F. Elon Musk (Harper Collins, 2015).

  17. Smith, R. J. Shuttle problems compromise space program: with the shuttle earth-bound, political troubles and cost overruns take off. Science 206, 910–914 (1979).

    Article  Google Scholar 

  18. Perkel, J. M. How to fix your scientific coding errors. Nature 602, 172–173 (2022).

    Article  Google Scholar 

  19. Lakshmanan, V., Robinson, S. & Munn, M. Machine Learning Design Patterns (O’Reilly Media, 2020).

  20. Krekel, H. et al. Pytest x.y; https://github.com/pytest-dev/pytest (2004).

  21. MacIver, D. R. Hypothesis x.y.; https://github.com/HypothesisWorks/hypothesis-python (2016).

  22. Baumgartner, P. Ways I Use Testing as a Data Scientist https://www.peterbaumgartner.com/blog/testing-for-data-science/ (2021).

  23. Niels, B. pandera: statistical data validation of pandas dataframes. In Proc. 19th Python in Science Conference (eds Agarwal, M. et al.) 116–124 (2020).

  24. Goodhart, C. A. in Monetary Theory and Practice 91–121 (Springer, 1984).

  25. Hoskin, K. in Accountability: Power, Ethos and the Technologies of Managing (eds Munro., R. & Mouritsen, J.) 265 (Cengage Learning EMEA, 1996).

  26. Muller, J. Z. in The Tyranny of Metrics (Princeton Univ. Press, 2019).

  27. The Turing Way Community. The Turing Way: A Handbook for Reproducible, Ethical and Collaborative Research 1.0.1 (Alan Turing Institute, 2021).

  28. Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998).

    Article  MATH  Google Scholar 

  29. Valverde, S. & Solé, R. V. Hierarchical small worlds in software architecture. Preprint at https://arxiv.org/abs/cond-mat/0307278 (2003).

Download references

Acknowledgements

We are grateful to the EU/EFPIA Innovative Medicines Initiative project DRAGON (101005122; S.D. and M.R., AIX-COVNET, C.-B.S.), Trinity Challenge BloodCounts! project (M.R., J.G. and C.-B.S.), EPSRC Cambridge Mathematics of Information in Healthcare Hub EP/T017961/1 (M.R., J.H.F.R., J.A.D.A. and C.-B.S.), Cantab Capital Institute for the Mathematics of Information (C.-B.S.), the European Research Council for Horizon 2020 grant no. 777826 (C.-B.S.), the Alan Turing Institute (C.-B.S.), the Wellcome Trust (J.H.F.R.), Cancer Research UK Cambridge Centre (C9685/A25177; C.-B.S.), the British Heart Foundation (J.H.F.R.), NIHR Cambridge Biomedical Research Centre (J.H.F.R.), HEFCE (J.H.F.R.), Leverhulme Trust project on ‘Breaking the non-convexity barrier’ (C.-B.S.), the Philip Leverhulme Prize (C.-B.S.), EPSRC grants EP/S026045/1 and EP/T003553/1 (C.-B.S.) and the Wellcome Innovator Award RG98755 (C.-B.S.). We are also grateful to Intel for financial support, I. Selby for creative input, and J.-C. Lohmann, S. Griffith, J. Tang and F. Zhang for comments and discussions.

Author information

Authors and Affiliations

Authors

Consortia

Corresponding authors

Correspondence to Sören Dittmer or Michael Roberts.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Ben MacArthur and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dittmer, S., Roberts, M., Gilbey, J. et al. Navigating the development challenges in creating complex data systems. Nat Mach Intell 5, 681–686 (2023). https://doi.org/10.1038/s42256-023-00665-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-023-00665-x

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics