Introduction

Traditional Chinese medicine (TCM) is an integral part of the Chinese cultural heritage, with a history of five thousand years and rich clinical experience, and is an important means of treating many diseases1,2. With the development of modern scientific technology, an increasing number of studies have confirmed the pharmacological effects and therapeutic efficacy of Chinese herbs, making TCM a research hotspot worldwide3,4,5. The World Health Organization (WHO) emphasized TCM as a popular and effective complementary and alternative medicine for the prevention and treatment of many ailments in the influential global medical compendium6. For example, studies have shown that Lianhua Qingwen is more effective than modern therapies in the treatment of COVID-197,8. As an important component of TCM treatment, Chinese patent medicine instructions (CPMI) have significant implications for enhancing the clinical application value of Chinese herbs, standardizing the use of Chinese medicine, and ensuring patient safety.

Currently, research on pre-trained language models (PLMs) in the domain of TCM mainly focuses on entity recognition, clinical record classification, and feature extraction9,10,11. Pan et al.12 conducted an in-depth study on electronic medical records (EMRs) in TCM and proposed a named entity recognition (NER) pipeline called the ALBERT-BiLSTM-CRF, which focuses on TCM orthopedic EMRs. The pipeline utilizes the ALBERT as the base model to encode and embed the labeled data, applies BiLSTM to establish comprehensive contextual semantics, and feeds the concatenated vectors into the CRF layer for decoding using the Viterbi algorithm. Experimental results demonstrate that compared to NER models based on BERT, ALBERT-BiLSTM-CRF achieves higher accuracy. To achieve Chinese sentence classification in clinical medical records, Zou et al.13 proposed a domain-adaptive PLM called CEMR-LM. CEMR-LM is pre-trained using a large amount of unlabeled clinical corpus to acquire knowledge in the field of TCM. The model's performance is enhanced by combining fine-tuning strategy and a dual-channel mechanism. Chen et al.14 conducted a study on coronary heart disease and developed a pre-training diagnostic model based on the BERT model trained on TCM texts. They successfully performed a text classification task on coronary heart disease medical records. The performance improvement compared to the model without TCM pre-training was 0.096. In the feature extraction process in the field of TCM, Chen et al.15 combined BERT with a one-dimensional convolutional neural network (1D-CNN) for fine-tuning the pre-trained model. Their model demonstrated a significant improvement in F1 score compared to the traditional 1D-CNN classifier, achieving state-of-the-art performance. Gao et al.16 proposed TCM2Vec, which initializes the relationship features between herbs by constructing two independent encoders. One encoder utilizes the unsupervised pre-training model FMh2v based on cross-features, while the other simulates the multi-dimensional features of drugs using a normal distribution. Finally, the relationship and drug features are integrated for deep feature extraction. TCM2Vec serves as an effective method for obtaining feature embeddings of TCM prescriptions, providing crucial insights for the adaptability of artificial intelligence technology in the field of TCM. Overall, research on deep learning in TCM primarily focuses on utilizing structured data and domain knowledge to enhance the automation and analysis efficiency of TCM information processing.

In recent years, researchers have made significant efforts in the field of TCM, paving the way for the development of domain-specific large language models (LLMs) with extensive knowledge in TCM. Wang et al.17 conducted supervised fine-tuning of the LLaMA-based model using over 8000 instruction-based question–answer data to develop HuaTuo, aiming to obtain more reliable medical knowledge in the domain of Chinese medicine. Xu et al.18 employed a dataset comprising 20,000 TCM records to train an auxiliary diagnostic model based on BERT. The objective of their study was to effectively utilize the information from the four diagnostic methods of TCM and provide TCM-based disease diagnosis for patients. In the study by Zhong et al.19, they incorporated semantic features related to TCM acupoints into the corpus and improved BERT through fine-tuning, resulting in a classification model named Bert-Chinese-Acupoint. This model aims to recommend the optimal primary acupoints for treating diseases and diagnose diseases through a classification task. The advantage of domain-specific LLMs over general models lies in their focus and domain expertise. Domain-specific models typically undergo more in-depth learning and understanding of the knowledge within that specific domain, enabling them to provide highly perceptive responses and solutions. This is particularly evident in the domains of TCM consultation, automatic generation of TCM prescriptions, and recommendations for Chinese patent medicine (CPM).

TCM diagnosis is an empirical medical method. TCM practitioners utilize observation, auscultation, inquiry and pulse diagnosis, combined with their personal clinical experience, to comprehensively analyze and diagnose patients20,21. However, during this process, subjective judgments and individual experiences of doctors may introduce potential errors, thereby escalating the risk of medical incidents. The utilization of advanced LLMs can mitigate the influence of subjective factors on medical decision-making, enabling clinicians to make precise, patient-centered determinations22. As a result, this integration can enhance the overall efficacy and quality of healthcare. Furthermore, the reasoning capabilities inherent in LLMs aligns harmoniously with the fundamental principles of TCM's synthesis of the four diagnostic methods, presenting novel avenues and methodologies for TCM research.

Differentiating from previous studies, we present a novel approach for constructing LLMs in the field of CPMI. This approach primarily focuses on automatically generating corresponding recommendations for CPM and detailed instructions for usage based on patients' symptoms and complaints. We constructed an instruction dataset using 3,906 labeled consultation records related to CPM treatments. The foundation model, ChatGLM-6B, was trained with parameter-efficient fine-tuning (PEFT) methods, leading to the development of a novel large-scale language model, CPMI-ChatGLM, specific to the domain of TCM. Evaluating CPMI-ChatGLM involved a combination of automatic assessment and human evaluation, employing metrics such as BELU, ROUGE, BARTScore, and SUS score. By constructing a comprehensive dataset and employing PLMs for the analysis and processing of CPMI, we can identify information such as the drug names, ingredients, specifications, usage, and precautions of TCM. The objective of this study is to assist doctors and patients in gaining a better understanding of the efficacy, dosage, and administration of Chinese medicines, thereby enhancing medication safety, alleviating healthcare burdens, and promoting the inheritance and development of Chinese traditional culture. We have released the dataset of CPMI in the Github repository https://github.com/liucann/CPMI-ChatGLM.

In summary, the main contributions can be summarized as follows:

  • To the best of our knowledge, this is the first comprehensive study of CPMI using LLMs. In this study, we proposed a domain-specific language model named CPMI-ChatGLM, which is the first LLM designed for the CPMI.

  • We performed instruct-tuning on the foundation model using a high-quality dataset of CPMI.

  • We constructed and publicly released the first dataset of CPMI, containing 7 medical specialties and 3906 medical records. Our intention is for this dataset to serve as a valuable resource for both research and application in the field of Chinese medicines.

Results

LoRA versus P-Tuning v2

We evaluated two PEFT methods on a server equipped with two RTX 3090Ti (24G) GPUs and assessed their impact on the performance of CPMI-ChatGLM. We used the Rouge_chinese23 and NLTK24 toolkits to calculate ROUGE and BLEU scores respectively, and BARTScore calculation was performed using the default parameter configuration. Figure 1 shows the training loss curves of the two PEFT methods, and the models converged at around 3000 steps. The smooth loss curves indicate stable performance improvement and effective training. Furthermore, as shown in Fig. 2, CPMI-ChatGLM fine-tuned using the P-Tuning v2 method outperformed LoRA in all aspects. Table 1 presents the specific numerical comparison results for different fine-tuning methods. The results indicate that P-Tuning v2 outperformed LoRA in terms of F1 scores, with improvements of 33.81% for ROUGE-1, 46.93% for ROUGE-2, 33.53% for ROUGE-L, and 55.09% for BLEU. The BARTScore value of P-Tuning v2 was also significantly better than that of the LoRA method. These results have astonished us.

Figure 1
figure 1

The loss curve of the CPMI-ChatGLM training process. The blue line represents LoRA, while the orange line represents P-Tuning v2.

Figure 2
figure 2

The impact of two fine-tuning methods on the performance of the CPMI-ChatGLM model. It is evident that the P-Tuning v2 fine-tuning method performs better in terms of F1 score. The legend uses blue to represent the LoRA method and orange to represent the P-Tuning v2.

Table 1 The evaluation results of two different PEFT methods.

We randomly consulted a common issue of gum swelling and mouth ulcers caused by excessive internal heat, and sought recommendations from CPMI-ChatGLM for Chinese patent medicine. The suggestions provided by LoRA were San Huang Tang, while P-Tuning v2 recommended the use of Huanglian Shangqing Wan. These recommendations were evaluated by clinical TCM practitioner.

The TCM practitioner pointed out that San Huang Tang belongs to the category of traditional Chinese medicine formula, and the version commonly available in the market is known as San Huang Pian (tablets). San Huang Pian is primarily used to address the condition of excessive internal heat in the upper, middle, and lower parts of the body, and it is suitable for symptoms such as gingival swelling, sore throat, and constipation. On the other hand, Huanglian Shangqing Wan are mainly used to alleviate oral heat-related symptoms, such as mouth ulcers and gingival swelling. Since the patient only mentioned the presence of oral manifestations of excessive internal heat and did not indicate the existence of such symptoms in other parts of the body, Huanglian Shangqing Wan are considered more suitable in this case.

We also discovered that the manufacturer information provided by LoRA was fabricated. The intention behind including information about the drug manufacturer during our training process was to account for potential variations in specifications and dosages of CPM due to different manufacturers. Patients may rely on the manufacturer information to assist them in determining which specification of the medication to purchase and take. The two recommendations provided by CPMI-ChatGLM, as mentioned above, are listed in Table 2.

Table 2 Comparison of two fine-tuning methods on a real case.

How much data is good enough?

We investigate the influence of dataset size on the performance of CPMI-ChatGLM across five different data scales. These data scales encompass 268 CPM data sourced from the Guidelines (0.3 k), a combined dataset of 694 CPMI data collected and meticulously processed from the Guidelines, Tianchi, and the outpatient department of the Chinese Medicine Hospital (0.6 k), along with 3 k, 6 k, and 10 k data augmented using the Self-chatting method based on the 0.6 k data. The comparison of results from fine-tuning CPMI-ChatGLM with different data scales is shown in Fig. 3.

Figure 3
figure 3

The line chart of various data scales on the performance of CPMI-ChatGLM. The optimal performance of CPMI-ChatGLM is achieved when the data volume reaches approximately 3 k. BLEU and ROUGE employ the F1 score for calculating the scores.

In general, the performance of a model tends to improve with an increase in data volume25,26,27. This is because larger-scale data provides more information, helping the model better learn features and patterns while reducing the risk of overfitting. Our research findings show that with the inclusion of CPMI data, the performance of CPMI-ChatGLM keeps improving until reaching approximately 3 k data points, as the model learns more data patterns and features. However, as the training data continues to increase, the model starts to overfit the training data, resulting in a decline in performance due to decreased generalization ability on new data.

The answers to the same three problems regarding the recommendation of CPM by different models are provided in the Supplementary Table S1. Specifically, the model that has not undergone fine-tuning, known as the foundation model ChatGLM-6B, exhibits limited capability in recommending CPM. It simply suggests some suitable names of CPM without providing specific instructions on administration methods and precautions. After fine-tuning with small-scale datasets, specifically datasets consisting of 0.3 k and 0.6 k, the CPMI-ChatGLM model's answers mostly adhere to the required format. However, its ability to accurately understand the symptoms and recommend CPM with corresponding efficacy still requires improvement. Conversely, as the dataset reaches a certain scale, specifically 6 k and 10 k, the model's generated unrealistic outcomes gradually increase, resulting in the inclusion of Chinese medicine names that do not exist in reality and even incorrect medication precautions in the answers.

Ablation study of instruction data

By providing specific prompt words, prompt engineering can help the model better understand and generate text, thereby improving its performance and achieving satisfactory results28. Similar to prompts, fine-tuning the model using instruction data can enhance the model's adaptability to specific tasks and scenarios, often leading to more prominent performance outcomes.

In this section, we conducted ablation experiments on the instruction dataset of CPMI. By removing the instruction part from the dataset and fine-tuning the foundation model using the same hyperparameters and the P-Tuning v2 method, we compared the performance of instruction tuning with non-instruction tuning. The experimental results demonstrate that instruction tuning can significantly improve the performance of the CPMI-ChatGLM model, enhancing both the accuracy and coherence of the generated text. The experimental results in Table 3 show that instruction tuning improves the performance of CPMI-ChatGLM by approximately 20%. This also indicates that instruction tuning can help large-scale models better understand the context and intent of the input, resulting in more accurate and expected text generation, thereby improving the quality and effectiveness of the model.

Table 3 The results of ablation study.

Comparisons to other models

To demonstrate the superior performance of CPMI-ChatGLM and ensure the fairness of the experiments, we conducted a comparative analysis by comparing it with four widely-used models that employ Chinese pre-training corpus and possess a similar scale of parameters.

  • Chinese-LLaMA-7B29: This model is an extension of LLaMA-7B that incorporates Chinese vocabulary and continues pre-training with Chinese embeddings, resulting in a Chinese-specific LLaMA model.

  • Chinese-Alpaca-7B29: Building upon the Chinese-LLaMA-7B model, the Chinese-Alpaca-7B was further fine-tuned using an instruction dataset, resulting in an improved Chinese LLaMA model.

  • Qwen-7B30: An open-source model from Alibaba Group Qwen series, with a parameter size of 7 billion. Qwen-7B is a large-scale Transformer-based language model trained on an extensive range of pre-training data, including a diverse collection of web text, professional books, and code.

  • Baichuan-7B31: Developed by Baichuan Intelligent Technology, Baichuan-7B is an open-source, large-scale pre-training model. With 70 billion parameters trained on approximately 1.2 trillion tokens, this transformer-based model supports both Chinese and English. It achieves the best performance among models of the same size on standard Chinese and English benchmarks (C-EVAL32/MMLU33).

The comparative experimental results of the models are presented in Table 4. Among models of comparable scale, our CPMI-ChatGLM achieves the best performance. Regarding the composition of the training corpora, Qwen-7B and Baichuan-7B, which undergo pre-training on a larger volume of Chinese text, outperform Chinese-LLaMA-7B and Chinese-Alpaca-7B, which are based on LLaMA-7B pre-trained using English text. Moreover, Chinese-Alpaca-7B exhibits superior performance compared to Chinese-LLaMA-7B, which lacks prompt fine-tuning.

Table 4 Comparative experimental results of different models.

Human evaluation

Although automatic metrics play a certain role in evaluating the performance of LLMs, human evaluation remains necessary to ensure the model’s performance in terms of safety, validation of professional knowledge, flexibility and adaptability, as well as ethical considerations. For the Chinese medicine QA task, our study introduces the SUS (Safety, Usability, and Smoothness) human evaluation method17. The SUS consists of three dimensions: safety, usability, and smoothness. The “safety” dimension assesses whether the model-generated content has the potential to mislead users and pose a risk to their health. The “usability” dimension evaluates the extent to which the generated content reflects professional knowledge, while the “smoothness” dimension measures the proficiency of the generation model as an LLM. SUS adopts a three-point scoring mechanism, with scores ranging from 1 (unacceptable) to 3 (good), and 2 indicating an acceptable performance. To evaluate the model's performance, we recruited five raters with a background in Chinese medicine to score 20 randomly selected questions regarding CPM recommendations. Table 5 presents the average SUS scores along with their corresponding 95% confidence intervals.

Table 5 Table 5. The SUS scores and their corresponding 95% confidence intervals across different models.

Compared to other models, Chinese-LLaMA-7B and Chinese-Alpaca-7B generate content that contains a higher number of English letters and symbols, which significantly impacts usability and fluency, leading to a diminished user experience and lower usability scores. On the other hand, our CPMI-ChatGLM model markedly enhances the usability of knowledge while ensuring the security and fluency of the results.

Model parameter setting and cost

We developed CPMI-ChatGLM by fine-tuning ChatGLM-6B using the PEFT approach. Despite having only 6.2 billion parameters, ChatGLM-6B still performs consistently with human preferences. The detailed hyperparameter information for fine-tuning ChatGLM-6B using the LoRA algorithm and P-Tuning v2, as well as the specific hyperparameter details for other comparative models, are provided below:

For LoRA fine-tuning, we conducted fine-tuning for 5 h on two RTX 3090Ti (24 GB) GPUs. The batch size during training was set to 2, the learning rate for the AdamW34 optimizer was set to 2e−5, the total number of training steps was 6000, and the maximum sequence length was set to 256. No specific settings were applied for warmup and weight decay. Additionally, the rank of the low-rank matrix was set to 8, the scaling factor was set to 32, and the dropout rate was set to 0.1.

For P-Tuning v2, the batch size was set to 1. We used the AdamW optimizer with the default learning rate decay strategy, setting the learning rate to 2e−2. The gradient accumulation was performed every 16 steps, the maximum source length was set to 32, and the maximum target length was set to 256. The total number of training steps, similar to LoRA, was set to 6000, and no warmup or weight decay was applied. Under 4-bit quantization, CPMI-ChatGLM achieved inference on a single RTX 3090Ti (24 GB) GPU in just 5 h, consuming approximately 8 GB of GPU memory, making it affordable for most researchers.

Four comparison models were trained using supervised fine-tuning, with a maximum input sequence length of 512. For Chinese-LLaMA-7B and Chinese-Alpaca-7B, the batch size for training was set to 4. The AdamW optimizer was employed with an initial learning rate of 1e-4. Both models employed the constant warmup method as the learning rate scheduling strategy, dynamically adjusting the learning rate. Gradient accumulation was performed every 4 steps, and both models were trained for 10 epochs. For Qwen-7B, a batch size of 4 was used for training. The AdamW optimizer had an initial learning rate of 1e-3, and the learning rate scheduling was conducted using the cosine annealing strategy. The number of gradient accumulations was 4, and a total of 8 epochs were trained. For Baichuan-7B, the batch size for training was set to 8. The initial learning rate was 1e−4, and the learning rate was dynamically adjusted using the cosine annealing strategy. Gradient accumulation was performed every 4 steps, and a total of 10 epochs were trained.

Discussion

Currently, large-scale models in the medical field primarily focus on extracting symptoms and clinical entities from the input to ensure the generation of healthcare recommendations35,36,37. While this approach is beneficial for diagnosing and treating diseases, there is still a lack of emphasis on generating detailed usage instructions for medications. In reality, patients are more interested in directly understanding the specific usage instructions and dosages for each administration, as this information is crucial for the proper use, therapeutic efficacy, and safety of medications.

We collected and constructed labeled CPM data through three sources: Standard Therapeutic Guidelines for National Essential Drugs (Chinese Patent Medicine), Entity Recognition of Traditional Chinese Medicine, and hospital consultation records. This was done to address the need for CPM recommendations and detailed usage instructions. Large-scale models are capable of identifying and analyzing various aspects of medications, including their ingredients, dosages, administration methods, and precautions. Leveraging the advantages of PLMs, we have implemented specific CPM recommendations and provided detailed usage instructions. This enables doctors and patients to access more comprehensive and accurate information about medications, helping them better understand the effects and possible side effects of the medications. Furthermore, the model can offer personalized medication advice based on the specific conditions of patients, ensuring the safety and effectiveness of the medications.

We fine-tuned the base model ChatGLM-6B to develop the CPMI-ChatGLM model for generating CPMI. By using the PEFT method to study the model's performance, we found that the model fine-tuned with P-Tuning v2 outperformed LoRA in all evaluation metrics. Regarding data scale, the performance of CPMI-ChatGLM reached its peak at a data volume of 3 k and declined with further increases. This trend may be attributed to the diversity of the data. In addition, we also investigated the influence of instruction data on model performance and conducted a comparative analysis among various commonly employed LLMs. Through a comprehensive evaluation combining both automatic metrics and human evaluation, we demonstrated the superiority of CPMI-ChatGLM. Considering resource and cost limitations, we selected a set of 3096 meticulously labeled data for fine-tuning CPMI-ChatGLM. Ultimately, we completed the 4-bit quantization P-Tuning v2 fine-tuning on a single RTX3090Ti GPU, consuming approximately 8 GB of GPU memory, making it accessible for most researchers to deploy locally. Furthermore, we have made the original dataset used in this study publicly available on our GitHub repository for reference and use by other researchers.

However, this study has certain limitations. Firstly, the relatively small parameter size of the foundation model and the data scale used may result in errors such as the inclusion of English characters in the generated Chinese text. To address this issue, we plan to experiment with larger-scale foundation models and incorporate manual evaluation methods to ensure higher quality. Secondly, we aim to expand the corpus beyond TCM to other categories of drugs, improving model performance that may be affected by data diversity, thereby making the model applicable to a wider range of medical conditions. In terms of model training, we intend to incorporate image-assisted information (such as pictures of the patient's affected area) to enable multimodal medical inquiry, further enhancing diagnostic accuracy and safety.

Conclusions

In conclusion, we have developed a new large-scale model, CPMI-ChatGLM, in the field of TCM, exploring new avenues for the integration of TCM and artificial intelligence. Through the PEFT method, we achieved superior performance compared to the foundation model by fine-tuning with instruction data. We also investigated the impact of different fine-tuning methods, data scales, and instruction data on the performance of CPMI-ChatGLM. Additionally, we have publicly released the first dataset of CPMI, aiming to contribute to the modernization and internationalization of TCM. Currently, the CPMI-ChatGLM project is in its early stages and may contain errors. We are actively collaborating with hospitals and medical experts to seek feedback and suggestions in order to improve its medical accuracy and assistive capabilities.

Methods

Dataset and data preprocessing

Our original dataset primarily derives from Standard Therapeutic Guidelines for National Essential Drugs (Chinese Patent Medicine) (ISBN 9787117286916). The guideline provides comprehensive and systematic information on 268 CPMs across 7 specialized fields, including internal medicine, surgery, gynecology, ophthalmology, otolaryngology, orthopedics, and pediatrics. It serves as a valuable resource, offering extensive and well-organized information on CPM. To enhance the model's capacity to learn various disease types, we expanded our dataset by merging it with Entity Recognition of Traditional Chinese Medicine's Manual from Aliyun Tianchi38. The Tianchi dataset comprises 1997 records sourced from instructions for TCM. It encompasses 13 key categories, including drugs, drug components, syndromes, properties, flavors, and Chinese medicinal effects. By incorporating this dataset into our corpus, the model can benefit from diverse sources of information, thereby improving its recognition and learning capabilities and enhancing overall performance. Additionally, we collected a set of 100 patient consultation records and corresponding CPM prescriptions from our affiliated TCM hospital outpatient department. After removing personal information, these additional data were merged into the corpus. The inclusion of these data supplements allows for a more accurate reflection of the actual application of CPM and further enhances the accuracy of the dataset.

We extracted information on the drug name, ingredients, description, specifications, indications, usage and dosage, adverse reactions, and precautions of CPM from 7 specialized medical fields in the Guidelines and manually added information on the manufacturer to construct a dataset of CPMI. However, some drug instructions in the Tianchi dataset had inconsistent attributes, such as the absence of a “drug name” or “indications” attribute, which are crucial for understanding the drug usage rules. To prevent the model from experiencing hallucinatory effects that could lead to incorrect guidance for users, we removed drugs with unknown attributes, retaining only 326 records without unknown labels.

Inspired by the “Self-chatting” approach in Baize39, we employed ChatGLM for data augmentation of our processed dataset. As a result, we generated five additional sentences that maintain a similar meaning and intent to the original patient's complaints in the dataset. These new sentences were then added to the dataset, resulting in a six-fold expansion of the dataset. In addition, to ensure the rationality and safety of CPM usage, we invited two senior clinical practitioners of TCM with advanced professions to review the CPMI dataset, further enhancing the quality and effectiveness of the dataset. After eliminating 258 instances of false and erroneous information caused by the hallucinations of LLMs, the dataset contains a total of 3906 data records.

Foundation model

ChatGLM-6B is an open-source conversational language model that supports bilingual question-answering in Chinese and English. It adopts the same model architecture as GLM-130B40 and utilizes the general language model (GLM) as its backbone. GLM41 is a transformer-based language model trained with autoregressive blank filling as the objective, supporting INT4 quantization and efficient inference on a single RTX 3090 GPU. In a comprehensive evaluation of 30 leading large models worldwide conducted by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), GLM-130B was the only selected model from Asia42. In the holistic evaluation conducted at HAI, 16 core scenarios were evaluated, encompassing tasks such as question answering, information retrieval, summarization, sentiment analysis, and more. The evaluation included seven metrics: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. This comprehensive approach ensures that indicators beyond accuracy receive due attention and explicitly highlights the trade-offs between models and metrics. Furthermore, a targeted assessment was carried out for 26 specific scenarios to delve deeper into specific aspects, such as knowledge, reasoning, memory/copyright, and misinformation. Among the 16 common task scenarios, GLM-130B exhibited outstanding performance in text classification tasks, achieving an impressive overall accuracy rate of 85.8%. In the evaluation of the seven metrics, GLM-130B demonstrated remarkable performance in terms of accuracy, fairness, toxicity, and overall text generation bias, reaching a level comparable to GPT-3 davinci v1 (175B). Additionally, GLM-130B showcased better robustness, calibration error, and lack of bias compared to GPT-3 in general.

ChatGLM-6B adopts a prefix decoder-only transformer framework, incorporating bidirectional attention mechanism for input and unidirectional attention mechanism for output. In terms of model details, ChatGLM-6B employs gradient scaling for the embedding layer and utilizes Post-LN layer normalization method to enhance training stability. Additionally, rotary positional encoding (RoPE) is employed as a replacement for traditional absolute positional encoding, and GeLU activation is used to improve the feedforward networks (FFNs) in the transformer architecture. Here, we adopted the ChatGLM-6B for more convenient training.

Our CPMI-ChatGLM

To transferring the knowledge from general domain to the CPM, we performed fine-tuning of the ChatGLM-6B on a dataset containing 3906 labeled CPMIs that we constructed. The resulting model, CPMI-ChatGLM, serves as an automated CPMI generator, providing prescribing recommendations for physicians and patients.

Figure 4 illustrates the workflow for constructing the CPMI-ChatGLM model for CPMI generation. Firstly, we collected data from various sources and constructed the CPMI corpus. After the initial data cleaning, which involved removing special characters and adjusting the format, the data underwent extraction of key attributes and a thorough review process. This step included the assessment of information by both machine algorithms and experts in TCM. The extracted key attributes were then used to form the final dataset, ensuring both accuracy and security. Subsequently, we employed ChatGLM for data augmentation, further enhancing the dataset size to improve the model’s performance. Finally, we fine-tuned the foundation model, ChatGLM-6B, using the PEFT method, resulting in the CPMI-ChatGLM model. The primary functionality of the model is to automatically generate recommendations for CPM treatment and corresponding detailed instruction information based on user-provided symptoms. This capability holds potential in assisting physicians with diagnosis and improving patient visit efficiency.

Figure 4
figure 4

The pipeline for training CPMI-ChatGLM. The collected corpus of CPMI was subjected to preliminary data cleaning, followed by machine screening and TCM expert review to form the training dataset. Then the ChatGLM was used to expand the dataset, and the foundation model was parameter-efficient fine-tuned to construct CPMI-ChatGLM, an LLM specifically designed for traditional Chinese medicine instructions.

Parameter-efficient fine-tuning

With the rise of LLMs, the parameter of PLMs has increased dramatically43. However, due to resource and cost limitations, it has become impractical for ordinary researchers to perform full-parameter fine-tuning of LLMs on consumer-grade hardware. Additionally, storing and deploying separate fine-tuned models for each downstream task has become prohibitively expensive, as the size of the fine-tuned models remains the same as the original pre-trained model. To address these challenges, PEFT was proposed44. PEFT method involves fine-tuning only a small or additional set of model parameters while keeping the majority of pre-training parameters fixed, resulting in significant reductions in computation and storage costs. Moreover, advanced PEFT techniques such as Adapter-Tuning, Prefix-Tuning, P-Tuning, and LoRA have achieved performance comparable to full fine-tuning.

Adapter-Tuning45 involves inserting smaller neural network layers or modules (referred to as adapters) into each layer of the pre-trained model. During the fine-tuning process, the parameters of the original transformer are frozen, and only the parameters of the adapter layers are learned. Prefix-Tuning46 adds additional trainable prefix pseudo-tokens to the input or hidden layers of the model, and only these prefix parameters are trained.

P-Tuning47 follows a similar approach to Prefix-Tuning by utilizing a small number of continuous embedding parameters as prompts to improve the application of generative pre-trained transformer (GPT) in natural language understanding (NLU) tasks. The difference lies in the fact that Prefix-Tuning is designed for natural language generation (NLG) tasks, whereas P-Tuning focuses on adding parameters only in the embedding layer, as opposed to introducing trainable parameters in every layer like Prefix-Tuning. P-Tuning v248 applies Prefix-Tuning to NLU tasks, specifically the orange section in Fig. 5, and applies continuous prompts at each layer of the model while optimizing the prompt parameters.

Figure 5
figure 5

The schematic diagram of P-Tuning v2. The orange blocks (i.e., \({h}_{0}\),…,\({h}_{i}\)) represent trainable prompt embeddings, and the blue part represents the embedding layer, which is stored or computed by a frozen pre-trained language model. The dashed arrow that returns to the input indicates the optional reparameterization optimization mode.

LoRA49 simulates full-parameter fine-tuning by introducing auxiliary matrices \(A\) and \(B\). It approximates the parameter updates of the model's weight matrix using low-rank matrices learned from small parameters. During training, only the parameters of the low-rank matrices are optimized. For the linear layer \(h=Wx\), the forward propagation is replaced with the following formulation:

$$\begin{array}{c}h=Wx+BAx\end{array}$$
(1)

where \(W\in {R}^{d\times d}\), \(A\in {R}^{d\times r}\), \(B\in {R}^{r\times d}\), with the rank \(r\ll d\), Matrix \(A\) is initialized with a random Gaussian distribution, while matrix \(B\) is initialized with all zeros, ensuring that only the main branch is active during the initial stage. The forward propagation of data in LoRA is illustrated in Fig. 6.

Figure 6
figure 6

The forward propagation of data in LoRA. The input data \(x\) is fed into a weight matrix \(W\) on the left and two weight matrices \(A\) and \(B\) on the right. The hidden layer output dimensions of both sides are equal, with a value of \(d\). The output results from the left and right sides are combined through addition to yield the final output result, denoted as \(h\).

In this study, we employed the P-Tuning v2 method to fine-tune the ChatGLM-6B model and compared it with the LoRA fine-tuning method using the same instruction data. The aim was to develop a high-performing, cost-effective, and applicable language model for the field of TCM to meet practical application needs.

PEFT with instruction data

Fine-tuning LLMs using machine-generated instruction data enables significant zero-shot capabilities on new tasks without the need for manual instruction writing50. Inspired by Self-Instruct51, data can be self-bootstrapped to enhance the ability of PLMs to follow instructions. By incorporating TCM knowledge, we provide instructions to guide the model in correctly answering with appropriate CPM for input medical cases. For domain-specific tasks, compared to general-domain models, it is often sufficient to use a small set of instructions to guide data generation. This strategy improves model performance in specific domain tasks while reducing the preparation and processing costs associated with instruction data, thus enhancing the efficiency of model training. Table 6 displays the instruction data utilized in this study to guide the generation of CPMI.

Table 6 An example of instruction data.

Metrics

In this study, Bilingual Evaluation Understudy (BLEU)52, Recall-Oriented Understudy for Gisting Evaluation (ROUGE)23 and BARTScore53 are employed to evaluate the degree of match between the candidate text and the reference text. These metrics enable us to comprehensively assess the performance of the model in terms of accuracy, fluency, and information completeness, thereby providing guidance for the improvement and optimization of the model.

BLEU is a metric used to measure the similarity between two texts. It is calculated using the following formula:

$$\begin{array}{c}BLEU=BP*{\text{exp}}\left(\frac{1}{n}\sum_{i=1}^{N}{P}_{n}\right)\end{array}$$
(2)

\(BP\) in BLEU stands for brevity penalty, which penalizes excessively short sentences to prevent the model from favoring shorter sentences during training. The expression for calculating the \(BP\) is as follows:

$$BP = \left\{ {\begin{array}{*{20}c} {1,} & {if\;l_{c} > l_{s} } \\ {e^{{1 - \frac{{l_{s} }}{{l_{c} }}}} ,} & { if l_{c} \le l_{s} } \\ \end{array} } \right.$$
(3)

In Formula 3, \({l}_{c}\) represents the length of the candidate, and \({l}_{s}\) represents the effective length of the reference. When the length of the candidate is greater than the length of the reference, the \(BP\) is 1, indicating no penalty. Otherwise, the \(BP\) is calculated.

In addition, in Formula 2, \({P}_{n}\) represents the precision based on n-grams, which is expressed as follows:

$$\begin{array}{c}{P}_{n}= \frac{\sum_{i}^{E}\sum_{k}^{K}{\text{min}}\left({h}_{k}\left({c}_{i}\right),{ \, min}_{j\in M}{h}_{k}({s}_{i,j})\right) }{\sum_{i}^{E}\sum_{k}^{K}{\text{min}}\left({h}_{k}\left({c}_{i}\right)\right)}\end{array}$$
(4)

\(E\) represents the total number of candidate texts, \(K\) represents the total number of word groups. \({h}_{k}\left({c}_{i}\right)\) represents the frequency of the \(k\) th word group appearing in the candidate text \({c}_{i}\). \({s}_{j}\) represents the reference, where \(j\in M\), and \(M\) represents the number of reference answers. \({h}_{k}({s}_{i,j})\) represents the frequency of the \(k\) th word group appearing in the standard answer \({s}_{i,j}\).

ROUGE is one of the commonly used metrics in the field of text summarization for evaluating the quality of automatically generated summaries. It measures the overlap between the basic units of the summaries generated by a statistical model and the reference summaries created by humans. The formula for calculating ROUGE is as follows:

$$ROUGE - N = \frac{{\mathop \sum \nolimits_{{S \in \left\{ {Reference\;Summaries} \right\}}} \mathop \sum \nolimits_{{gram_{n} \in S}} Count_{match} \left( {gram_{n} } \right)}}{{\mathop \sum \nolimits_{{S \in \left\{ {Reference\;Summaries} \right\}}} \mathop \sum \nolimits_{{gram_{n} \in S}} Count\left( {gram_{n} } \right)}}$$
$$\begin{array}{c}. \end{array}$$
(5)

In Formula (5), \(n\) represents the length of n-grams, and \({Count}_{match}\left({gram}_{n}\right)\) denotes the maximum number of occurrences of n-grams that appear simultaneously in the candidate text and the reference text. ROUGE-1 measures the matching of unigrams, ROUGE-2 measures the matching of bigrams, and ROUGE-L captures the longest common subsequence, and so on.

BARTScore is the state-of-the-art metric proposed by Yuan et al. for evaluating natural language generation (NLG) in a general context. The concept behind BARTScore involves assessing the quality of sentences based on the generation probabilities derived from the large-scale pre-trained model BART54. It computes the logarithmic probability of each token in the hypothesis using an autoregressive approach and then averages them to obtain the overall score. This evaluation process can be formally expressed as:

$$\begin{array}{c}BARTSCORE=\sum_{t=1}^{m} \, {\omega }_{t} \, logp\left({{\varvec{y}}}_{t}\mid {{\varvec{y}}}_{<t},{\varvec{x}},\theta \right)\end{array}$$
(6)

\(\theta\) represents a seq2seq model, consisting of a source sequence containing \(n\) tokens \(x = \{{x}_{1}, ..., {x}_{n}\}\) and a target sequence containing \(m\) tokens \(y = \{{y}_{1}, ..., {y}_{m}\}\). The weight of the \(t\)-th target sequence is denoted as \({\omega }_{t}\).