As scientists avidly use, tinker and build with artificial intelligence tools, best practices begin to emerge.
The human-like language skills of the chatbot Eliza, presented in 1966 and modeled to respond as an empathic therapist might, left users eager for more interaction. Many judges in a 2014 Royal Society of London competition found that Eugene Goostman, a chatbot developed by Russian and Ukrainian programmers, had passed the so-called Turing test, with responses that were indistinguishable from human ones. Today’s large language models (LLMs) impress further with their human-like language capabilities1,2,3,4.
As researchers explore integrating LLMs into their science, they see how much more artificial intelligence (AI) needs to ‘learn’ about science and truth-finding. LLM use in teaching and research needs flexible rules. According to Stanford University’s Generative AI Policy Guidance, individual course instructors “are free to set their own policies” for the use of generative AI tools such as ChatGPT, Bard, DALL_E and Stable Diffusion. Stanford’s freedom to allow or disallow such tools is also true at the University of Oxford. According to Oxford’s guidance, the university supports students and staff in their desire “to become AI-literate.” AI supports learning “as long as that use is ethical and appropriate.”
Says Sandra Wachter, a social scientist and lawyer at the Oxford Internet Institute, she wants her students to develop critical thinking skills and does not allow them to use AI tools. At the University of Colorado Anschutz Medical Campus, the AI guidelines are also flexible, says Jennifer Richer, a breast cancer researcher at Anschutz, who is also dean of the graduate school. The campus licenses and recommends Microsoft Copilot AI, but what is never entered into these systems is highly protected data such as personal health information. “It seems that the guidelines are working well and that programs appreciate the ability to make specific rules for individual classes,” she says. Graduate students are learning how to use AI in helpful ways with proper attribution.
Educating oneself about AI extends a concept he has always held dear, says Sethuraman Panchanathan, the director of the US National Science Foundation (NSF). “I’ve always looked at this as literacy,” he says. Earlier in his career, at Arizona State University, he remembers saying how every student at the university needs informatics literacy. “And now that matured to data science literacy, now AI literacy,” he says. With AI advances come a need for upskilling and reskilling at all educational levels and throughout one’s career. “The only constant is change,” he says. “And lifelong learning is an important imperative.”
With AI tools, data preparation can be a bear of a task, says Panchanathan. But investing time and energy to prep and clean data to build a system that enables “amazing connections, new revelations that you might not have gleaned, then it is worth it.” Eventually, AI tools will help with data-cleaning and organizing. “We will get there; there is always a process to go from here today,” he says.
Use cases
Putting in place verifiable, explainable AI is important, says Panchanathan. “Even if it is a black box, people want to understand: how does this black box work at a macro level?” Science thrives in an open, transparent world. Shortly after the Biden Administration’s executive order on AI, the NSF and other collaborating government agencies launched the National Artificial Intelligence Research Resource (NAIRR) Pilot to build a shared research infrastructure in AI.
One NAIRR-funded project is from University of Missouri computer scientist Dong Xu. He and his team develop deep learning algorithms and software for single-cell data analysis and protein-sequence-based predictions. Along with plant scientists, his team is building, validating and deploying a plant biology-specific and tuned foundation model to address specific challenges in plant-focused single-cell data analysis. Such models have been developed to analyze human single-cell data, but no similar model for plants exists, despite how much single-cell data there is, he says. “We believe this presents a unique opportunity.”
Part of the funding is 10,000 computing hours at the University of Illinois Urbana-Champaign’s National Center for Supercomputing Applications Delta supercomputer, which has many GPU computing options and is configured for memory-intensive applications. AI is becoming ubiquitous across fields, says Xu, but settling AI into biology means more than using AI for visually appealing graphs or reasonable dialog. In science, he says, “the accuracy of AI predictions must closely align with scientific observations, ensuring reliability and reducing hallucinations.”
In such projects, collaboration is critical, he says, especially because biology is inherently complex and involves sophisticated mechanisms. Domain knowledge is needed, as is “a deep integration between AI and biological expertise to make real progress.”
One AI application he would like to explore is AI-powered instantaneous translation, says Wolfgang Huber, a researcher at the European Molecular Biology Laboratory (EMBL) who also teaches in a bilingual bioinformatics summer school for Ukrainian scientists. At EMBL, he co-organized a meeting on AI and biology and saw excitement about AI applications firsthand. In genomics, scientists see promise in such tools for learning more about enhancers or splicing processes. Among researchers setting up single-cell atlases, he sees that many want to build foundation models that contain all known cell types. With such a system, researchers hope to explore structures in these cells and establish a “grammar of cell types.”
When using, adapting or refining AI tools, says Huber, one should keep in mind that they are trained on material from the past and that they make mistakes. Cross-checking results with multiple LLMs would only help if the mistakes were uncorrelated. But statistically speaking, that’s not the case. “There may be just inherent bias,” he says.
He finds AI tools very useful for retrieving information, for finding interesting patterns in data and for generating hypotheses. What then must follow is results validation, just as is the case more generally in ’omics and large-scale informatics analysis. To establish mechanism and causality takes experimental follow-up.
As Huber and colleagues work on how to make AI compatible with the scientific method, they know an AI tool’s results must be reproducible and understandable. In some respects, AI might be akin to less well understood areas of science such as quantum mechanics and small particle behavior. Those are where complex math is applied to behavior that is “alien to our human senses and experience.” Even so, he says, “I think the standards for AI in science will be different than, let’s say, for AI in movie recommendations.”
Wishes and wants
After leading and publishing a survey about chatbots in science, Jeremy Y. Ng keeps hearing how keen scientists are to learn how to effectively use AI tools and to educate themselves about them. Ng is a postdoctoral fellow and health research methodologist at Ottawa Hospital Research Institute’s Centre for Journalology and is in David Moher’s research group. Among other projects, the group assesses research quality and transparency, how AI affects science and how AI chatbots are being integrated into the scientific process. Ng sees an urgent need to offer guidance, training and policies for using of AI chatbots in research. “Unfortunately, as our research highlighted, there is a notable lack of comprehensive resources and structured guidelines in this area,” he says.
Among the 60,000 people in biomedical research he queried, Ng had over 2,000 respondents from 95 countries across all career stages5. Many said they are familiar with AI chatbots, mainly ChatGPT, and find them useful for research proposes, in literature searches and for writing and editing manuscripts.
Nearly 70% of respondents said their institution offers no training on using AI tools while 10% said their institution has policies about the use of AI chatbots in the scientific process. Eighty percent of responses noted a need for training to use AI chatbots effectively in the scientific process and nearly 70% expressed interest in training and learning about AI chatbots.
Slightly over half the respondents said AI chatbots make data analysis more efficient and help with handling large datasets. Nearly 70% or respondents disagreed that or were unsure about whether AI chatbots could increase reproducibility and transparency of research. Over 70% saw it as a challenge that chatbots gave biased or skewed data outputs.
Personally, Ng is cautiously optimistic about AI chatbots and the potential they offer to streamline aspects of the research process. He has begun using them to generate research ideas and to summarize texts. When organizing information, “I am mindful of the need for human oversight and critical appraisal, and do not rely on it entirely to complete complex research tasks such as manuscript writing,” he says.
Better resources for training and education about AI are needed, says Florian Jug, a group leader at Human Technopole, a research institute in Milan, Italy. Plentiful are the web resources one can tap into to become, he says, “absolute experts without ever interacting in person with others.” Some researchers who have discovered these resources take an 18-month dive deep on their own. When you emerge, you “can even start having a social life again,” he says. But he hopes for better educational approaches so “you don’t have to figure it all out only based on YouTube movies and trial and error in your basement.”
In the bioimaging community, for instance, he and colleagues, with support from the European Molecular Biology Organization, have launched courses such as ‘Deep learning for microscopy image analysis’ they hope will benefit students, he says, “and a fraction of this fraction might end up being hooked and slowly develop from users to ML contributors themselves.”
Model building
Foundation models at the core of LLMs are defined in different ways, says Jug. Generally, their architecture involves machine learning with neural networks and massive amounts of data scraped from the web and other sources. As a neural network is trained on data, it can extract patterns and capture those as weighted parameters of the network’s many connections. GPT-3 has around 175 billion such internal parameters.
The LLM GPT, which stands for Generative Pretrained Transformer, was developed by OpenAI and can be seen as the foundation model of ChatGPT. GPT and other LLMs are based on a computationally muscular transformer architecture. The final model can assess each word in the context of all other words. GPT was pretrained over months and with thousands of hours of GPU computing on internet-scale datasets, as computer scientist Andrej Karpathy, a founding member of OpenAI who has since left the company and who has just started an AI education platform called Eureka Labs, said in a talk about GPT and LLMs at Microsoft Build 2023. In pretraining, words are converted to numerical tokens, which for GPT-3 meant around 300 billion tokens.
After the computational heavy lifting of pretraining, steps such as supervised fine-tuning, reward modeling and reinforcement learning follow. The model’s training renders it able to deliver generalizable output, which can be not only text but also computer code, which is a text-like object. On GitHub, Karpathy has posted the steps for building GPT-2 from scratch and has a link to an accompanying video. This LLM is trained on documents scraped from the web and is not yet a chatbot.
Working with AI, and generating benefits from it, is much more than showing lots of data and sitting back and watching the model do its job, Arati Prabhakar, director of the White House Office of Science and Technology Policy said on a call with reporters about ‘AI aspirations: R&D for Public Missions’. This event convened participants from government agencies such as the Advanced Research Projects Agency for Health, the National Oceanic and Atmospheric Administration and others about projects that, for instance, harness AI in weather prediction, electrical grid management, materials research and drug development.
When a model has been built, says Jug, it can be used on new data or fine-tuned to handle a task it was not trained for. For instance, in imaging, one can turn a large, capable model used for deconvolution techniques in microscopy into a model for denoising. To do so, one reuses some of the ‘learned’ features. Such transfer learning doesn’t always work, but it’s great when it does, he says.
Fine-tuning a large model can be computationally intensive and can take a long time. “Maybe the slight improvement is worth the extra effort,” he says. This will likely vary from one context to the next.
Foundation models tend to involve giant transformers, but not necessarily so, says Jug. Cellpose6, for instance, has a general-purpose model that can be applied across the domain of fluorescent cell segmentation. Such models can be fine-tuned on new data and the system can apply what was ‘learned’ about segmenting to various types of cellular objects.
The Segment Anything Model7, or SAM, developed by Meta, can be used to ‘cut’ objects out of an image. Segment Anything for Microscopy8 is SAM fine-tuned for bioimage data. Developed by a team at Georg August University of Göttingen in Germany and colleagues at other universities, it applies segmentation to light microscopy and electron microscopy. “Is this super-useful for us?” asks Jug. “It can be.” It will be if its results are better than ones generated with regular neural networks trained for a few hours or days on a single GPU. What’s key, he says, is that a system trained only on huge amounts of light microscopy data will not deliver useful results about electron microscopy data.
Speaking more generally, says Jug, as a global community, “We must ask ourselves how we can enable a ChatGPT moment for a biomedical foundation model.” Once transformers became powerful enough to capture syntactic and semantic language structure, he says, they became able to ingest many types of information, from books about raising children to political manifestos, math textbooks, lullabies and even computer code. The extracted massive “latent space” now combines all this information in actionable ways and offer answers that can now be sampled and conditioned with a textual prompt. “How wonderful!” he says. What’s now missing is something just as universal for scientific data.
“If I can wish for something,” he says, it’s that generating useful findable, accessible, interoperable and reusable (FAIR) data — and he emphasizes the reusable data aspect — can be seen as an important scientific contribution. That would put scientists in a position to build a career focused on data generation. This commitment to FAIR data formats and open repositories will let other researchers find the data they need for their AI training tasks. “I think the key challenge for us, the scientific community, is to find ways to incentivize data generation, curation and FAIR data storage.” That will be “to the benefit of us all, as a community.”
Truth and ground truth
LLMs are designed to be convincing and helpful to users but are not designed to tell the truth and will stray from it or hallucinate, Wachter notes9 with co-authors Brent Mittelstadt and Chris Russell. The complexity of truth has long been studied in philosophy, as has the development and justification of truth in human discourse. With LLMs, the concept of truth has been simplified and is often equated with the accuracy measured against the training data’s ‘ground truth.’ But training data are not usually checked for truthfulness.
On this subject, Wachter points to work by the philosopher Harry Frankfurt10. He has noted that a “bullshitter” isn’t someone who distorts the truth on purpose but who is disconnected from a concern for truth. Bullshit, writes Frankfurt, “is a greater enemy of the truth than lies are.” Wachter wants trainees to have, she says, “a very good bullshit detector.”
As she makes decisions about when and when not to use LLMs, she uses a mental model of how the technology works. LLMs can’t deliver truth about a question that lacks an answer. One can’t treat an LLM like the oracle at Delphi or a Magic 8 Ball, says Wachter. “It’s more like a sloppy, unreliable research assistant,” she says. LLMs are eager to help and deliver an answer, but the answer is not a reliably true one. Because this assistant is inherently unreliable, the result must be fact-checked. Applying that mental model, she says, eases decisions about which tasks and applications are best suited for an LLM and which are not.
The language used related to LLMs is fraught, she says. The term hallucination is generally associated with a living person, not a machine. The word assumes consciousness, which drives intent and motive and indicates a form of cognitive ability. There’s a “problem with anthropomorphizing everything,” says Wachter. It’s human to project oneself, in this case, onto machines.
LLMs have gone through the process of reinforcement learning during which human feedback helps to rank the system’s output and assess whether it is helpful or harmful, and that assessment includes gauging whether responses will keep a user engaged. LLMs are technology designed to sound both convincing and human-like, which connects to the human tendency to anthropomorphize tools. This all takes place in an environment of claims that AI will soon outperform humans. It’s no wonder, she says, that we are enticed and tempted to believe that LLM output is truthful.
Training an LLM on material from the internet makes its output neither trustworthy nor reliable, and responses can be toxic, she says. Researchers fine-tuning a specialist model with an LLM will strive to build a reliable system. To keep improving the model takes time, resources and money. Yet “the business of tech companies is not slow science,” she says. “It’s fast money.”
ChatGPT recommendations can be wildly false. Some widely publicized chatbot recommendations, she says, include the use of glue to stop cheese from slipping from pizza or the option to apply for a license to sell meat made from human flesh. Most LLM users will identify such answers as wrong, says Wachter, but, “I worry about the subtle hallucinations.” A number might be wrong, a word, or other facets that users might not immediately catch. A nuance can be overlooked “because it’s not outrageously wrong.”
This means that scientists must check the accuracy of AI tool output, says Wachter, which can make some think that babysitting an algorithm is not why they chose a science career. AI might not be the method of choice when a researcher is grappling and thinking about a problem; exploring avenues, including rabbit holes; or refuting or re-examining something, which are all human tasks. To innovate, she says, takes breaking with the past. Such tasks are not well suited to machine learning given that it’s been trained on the past.
Says Richer, “the Graduate School logic is that on a campus focused on biomedical research, we should not attempt to stop technology, but rather to use it in responsible and helpful ways.” Harnessing AI usefully in research, she says, opens up time for scientists to spend on creativity and innovation. Critical thinking is, in Wachter’s view, such a crucial skill in the scientific process. “That isn’t something that we should be outsourcing.”
References
Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Nat. Rev. Phys. 5, 277–280 (2023).
Mittelstadt, B., Wachter, S. & Russell, C. Nat. Hum. Behav. 7, 1830–1832 (2023).
Messeri, L. & Crockett, M. J. Nature https://doi.org/10.1038/s41586-024-07146-0 (2024).
Kirk, H. R., Vidgen, B., Röttger, P. & Hale, S. Nat. Mach. Intell. 6, 383–392 (2024).
Ng, J. Y. et al. Preprint at medRxiv https://doi.org/10.1101/2024.02.27.24303462 (2024).
Stringer, C., Wang, T., Michaelos, M. & Pachitariu, M. Nat. Methods 18, 101–106 (2021).
Kirillov, A. et al. Preprint at arXiv https://doi.org/10.48550/arXiv.2304.02643 (2023).
Archit, A. et al. Preprint at bioRxiv https://doi.org/10.1101/2023.08.21.554208 (2023).
Wachter, S., Mittelstadt, B. & Russell, C. R. Soc. Open Sci. https://doi.org/10.2139/ssrn.4771884 (2024).
Frankfurt, H. G. On Bullshit (Princeton Univ. Press, 2005).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Marx, V. Quest for AI literacy. Nat Methods 21, 1412–1415 (2024). https://doi.org/10.1038/s41592-024-02369-5
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-024-02369-5