Discovering HIV related information by means of association rules and machine learning

Araujo, Lourdes; Martinez-Romo, Juan; Bisbal, Otilia; Sanchez-de-Madariaga, Ricardo

doi:10.1038/s41598-022-22695-y

Download PDF

Article
Open access
Published: 28 October 2022

Discovering HIV related information by means of association rules and machine learning

Lourdes Araujo^1,4,
Juan Martinez-Romo^1,4,
Otilia Bisbal²,
Ricardo Sanchez-de-Madariaga^3,4 &
The Cohort of the National AIDS Network (CoRIS)

Scientific Reports volume 12, Article number: 18208 (2022) Cite this article

2036 Accesses
2 Citations
13 Altmetric
Metrics details

Subjects

Abstract

Acquired immunodeficiency syndrome (AIDS) is still one of the main health problems worldwide. It is therefore essential to keep making progress in improving the prognosis and quality of life of affected patients. One way to advance along this pathway is to uncover connections between other disorders associated with HIV/AIDS—so that they can be anticipated and possibly mitigated. We propose to achieve this by using Association Rules (ARs). They allow us to represent the dependencies between a number of diseases and other specific diseases. However, classical techniques systematically generate every AR meeting some minimal conditions on data frequency, hence generating a vast amount of uninteresting ARs, which need to be filtered out. The lack of manually annotated ARs has favored unsupervised filtering, even though they produce limited results. In this paper, we propose a semi-supervised system, able to identify relevant ARs among HIV-related diseases with a minimal amount of annotated training data. Our system has been able to extract a good number of relationships between HIV-related diseases that have been previously detected in the literature but are scattered and are often little known. Furthermore, a number of plausible new relationships have shown up which deserve further investigation by qualified medical experts.

Identifying diseases symptoms and general rules using supervised and unsupervised machine learning

Article Open access 02 August 2024

Pattern discovery and disentanglement on relational datasets

Article Open access 11 March 2021

A machine learning model identifies patients in need of autoimmune disease testing using electronic health records

Article Open access 25 April 2023

Introduction

According to information provided by World Health Organization (WHO), HIV/AIDS remains one of the world’s most serious public health problems, particularly in low and middle-income countries. The development of AIDS (acquired immunodeficiency syndrome) disease, in patients infected with HIV, causes a progressive deterioration of the immune system and decreases the person’s ability to fight many infections and other diseases as well. AIDS refers to the most advanced stages of HIV infection and is defined by the development of one or more opportunistic infections or related cancers among many other possibilities.

WHO has released a number of policy guidelines to assist countries in implementing programs to improve HIV prevention, treatment, care and support services for affected patients. Several initiatives are underway in each country along these lines. One of these initiatives in Spain has been the launching of the Spanish HIV/AIDS Research Network whose main goal consists of improving the health and quality of those affected. This network has generated the HIV/AIDS Research Network Cohort (CoRIS), which makes available to researchers the data from its main database and associated satellite databases, linking biological samples. CoRIS is an open, prospective and multicenter cohort of adult subjects with confirmed HIV infection, launched in 2004. Patients over 13 years old, and naive to antirretroviral treatment (ART) at study entry, have been recruited in HIV care units of the Spanish Public Health System and all of them have signed an informed consent form.

HIV is associated with the development of a large number of other diseases. In some cases, these diseases can be more or less mild and transient. However, in other cases, they can be very serious and long-lasting. It is essential to know the possible relationships between this diversity of diseases associated with HIV, since increasing the knowledge about them can be of great help in their diagnosis and prevention, thus improving the patients’ quality of life. Knowledge about the conditions that typically appear with these diseases, and in what form, will help in making decisions about their prevention and treatment.

In this study, we have focused on applying machine learning (ML) techniques to extract information about the relationships between diseases associated with HIV. Specifically, we have turned to the extraction of association rules of high reliability and coverage to identify relevant relationships between HIV-related diseases.

Association rules (ARs)¹ are a data mining method that aims to discover patterns of co-occurrence between items in a transactional database. Specifically, we consider a set of n items $I= \{i_1,i_2,\ldots ,i_n\}$ and a set or database of transactions on these items: $I= \{i_1,i_2,\ldots ,i_n\}$. Each transaction is represented by a subset of items ($I= \{i_1,i_2,\ldots ,i_d\}$) that have occurred simultaneously.

ARs present the following form:

$$\begin{aligned} X \Rightarrow Y \end{aligned}$$

where X and Y are two disjoint sets of items. We focus on a particular type of rules whose consequent is a single element.

An example of an association rule, that we could find in a database in which the transactions are the number of diseases suffered by the same patient, could be the following:

Urinary tract infection, abdominal pain, diabetes $\Rightarrow$ renal failure

ARs have been frequently applied to the medical domain for different purposes. Imamura et al.² applied them to find clinical findings associated with diseases. They have also been used to analyze patterns of lifestyle risk behaviors including smoking, heavy drinking, physical inactivity³. There have also been proposals that applied ARs to find relationships between healthcare parameters and specific diseases⁴, such as antimicrobial resistance⁵, psoriasis⁶, COVID-19⁷, or Hospital-Acquired Infections⁸, among others.

Several algorithms have been proposed to systematically generate all ARs that satisfy certain favorable conditions. These conditions refer to parameters such as support (how frequent an itemset is in the transaction set) and confidence (the likeliness of occurrence of consequents in the set, given that the set already has the antecedents).

One of the algorithms that allow the generation of association rules from frequent itemsets is the Frequent-Pattern Growth or FP-Growth algorithm⁹.

By applying this or similar algorithms, we can generate the whole set of rules that satisfy the specified minimum support and minimum confidence thresholds. However, this process leads to a huge number of rules, many of which are uninteresting. Actually, discovering interesting or relevant rules is a difficult problem^10,11,12 that needs to be tackled. For example, considering the following AR:

pollen allergy $\Rightarrow$ renal failure

Since pollen allergy is a very common problem, it can appear in many rules as an antecedent or a consequent. Therefore, ARs, such as the one appearing above, may meet the required frequency threshold, and yet not provide relevant information.

A relevant rule is one that includes at least one relevant relationship between some disease of the antecedent and the one of the consequent. Relevant relationships are those validated by medical experts.

Filtering relevant or interesting rules is a difficult problem that is primarily tackled using unsupervised approaches that do not require expert-annotated training data. These methods attempt to find hidden patterns from raw unlabeled data. Among these unsupervised approaches are those based on associating to the AR a p-value (the likelihood that the association is spurious due to chance)^13,14. Specifically, the p-value of an AR R is the probability of observing R, or one rule stricter than R, when the two sides of R are independent. If a rule found in the data has a low p-value, it is unlikely that the two sides are independent. Rules with high p-values do not provide information about the independence of the two sides of the rule and can be discarded, as they have most likely appeared by pure chance.

However, unsupervised methods have a limit to the accuracy they can achieve. These limitations can be addressed by using supervised methods that are capable of learning from specific relationship examples, taking into account aspects beyond frequencies.

Supervised methods require labeled data that are used to extract knowledge. These methods start by applying a training process with a labeled data set and try to infer a function that fits the training data appropriately. Then, when applied on new data, this function is able to predict the output. In the case of ARs, few supervised systems have been proposed due to the lack of labeled data.

One way to circumvent this problem is to use the semi-supervised¹⁵ approach, which employs both labeled and unlabeled data in the training process. This approach typically uses a small amount of labeled data and a larger amount of raw data. Techniques based on this approach can be adjusted to improve their performance as they have larger amounts of training data, for example in a feedback process.

Sánchez-de-Madariaga et al.¹⁶ proposed a new semi-supervised data mining model, EXTRAE, that combines unsupervised techniques (p-value computed as Fisher’s exact test) with highly limited supervision. The training process starts with a small seed of annotated data, and the model improves its results (F-measure) using a fully supervised system (standard supervised machine learning algorithms). The key idea of this proposal is to enlarge the size of the training data by checking the agreement between the predictions of the supervised system and those of the unsupervised techniques in a series of iterative steps. The remarkable feature of this system is its ability to improve the results of purely supervised methods by combining them with unsupervised techniques. This system has been evaluated on data from the medical domain. Specifically, it has been evaluated on a set of ARs generated from primary care data, which have been manually labeled as true or false (of no interest). The diseases considered were rather common as they came from primary care data and the association rules generated are of limited interest. However, the results obtained showed the potential of the method designed.

The main goal of this study is to find poorly understood relationships between HIV-associated diseases using association rules. To this end, we propose to apply and adapt the semi-supervised algorithm with minimal supervision¹⁶ mentioned above, to filter association rules between HIV-related diseases. To apply the algorithm and also to be able to perform a proper evaluation, we have started by generating a dataset of association rules and then manually labeling a part of them as relevant or not by expert doctors in HIV. The rules considered by the doctor come from applying the FP-Growth algorithm to the CoRIS cohort data. The input data to our algorithm are ARs annotated with a label indicating whether they are relevant or not.

Since, to the best of our knowledge, there is little history of annotation of ARs as relevant or not, this annotation process has required the definition of a new annotation guideline, this being a contribution of the present study. Another important contribution is the associations found between diseases themselves. We also provide a highly accurate method for the classification of ARs as relevant or not, which has been evaluated on the previously created dataset.

In summary, the main contributions of the present study are outlined as follows:

An analysis of the most appropriate conditions for assigning the relevance of the rules based on the relationships between the diseases contained in them.
A collection of association rules between HIV-associated diseases labeled as relevant or not by experts.
Design of a semi-supervised system that requires only a very small amount of annotated data and is able to predict with high accuracy whether new ARs for HIV-associated diseases are relevant or not.
An expert analysis of various relationships between HIV-related diseases revealed by the ARs found.

Figure 1 shows a scheme of the applied semi-supervised system EXTRAE. From the data on the diseases suffered by each patient in the CoRIS database, all the ARs that satisfy minimum conditions of co-occurrence frequency (support and confidence) are generated. A medical expert evaluates a small set of these rules, which serves as a seed for the algorithm, as being relevant or not. That initial seed serves as a training set for a supervised algorithm. It also allows for certain adjustments of an unsupervised algorithm. When the predictions on the ARs of both algorithms match, they are considered sufficiently reliable to be added as reference data to the training set. The process is repeated until convergence is reached when no new ARs are added to the training set.

Our analysis has uncovered multiple relationships of interest between HIV-related diseases. Some of these relationships are well known to medical experts. Others are little known and, even though they could be confirmed in the literature, they were scattered and hence their compilation may represent a breakthrough in HIV research. Finally, other relationships, although plausible, are neither confirmed nor discarded in the literature and deserve to be studied in depth by medical professionals.

Methods

In this section, we include details of the reference collection on HIV-related diseases. We also introduce the process followed to analyze and annotate the association rules to be used to train and evaluate the semi-supervised model. Finally, we present the semi-supervised model EXTRAE applied for ARs filtering.

CORIS data

The Spanish HIV/AIDS research network (CoRIS) is an open, prospective, multicenter cohort of adult subjects with confirmed HIV infection, naïve to ART at study entry, recruited in 47 centers from 14 of 17 Autonomous Regions in Spain, from 2004 onwards. Data are organized and standardized following the HIV Cohorts Data Exchange Protocol (HICDEP) for data collection (details at https://hicdep.org/) and adhere to internal strict annual quality controls. The CoRIS database collects baseline and follow-up socio-demographic, immunological and clinical data. Patients are followed periodically according to routine clinical practice. The CoRIS cohort has been described in detail elsewhere¹⁷.

Each CoRIS participant provided his or her written informed consent prior to enrolling in this study. The CoRIS cohort was approved by the Research Ethic Committee of the Gregorio Marañón Hospital. All methods were carried out in accordance with relevant guidelines and regulations.

The HIV/AIDS Research Network Cohort was established in 2004 in conjunction with the HIV Biobank. These platforms are fundamental resources for improving knowledge about HIV. The coordination team is a multidisciplinary group: epidemiologists, statisticians, clinicians, pharmacists, microbiologist among others. The aim of CoRIS is to collect information on HIV-positive patients in order to study the epidemiological characteristics, the progression of the infection and its determinants, as well as the response to treatment and its influencing factors. CoRIS is organized around three structures:

Coordination center: it organizes and coordinates data collection, supporting hospitals in these procedures and carries out data processing for statistical analysis.
Clinical centers: hospitals and health centers where information and biological samples are collected.
CoRIS Scientific Committee: it decides on the scientific development of the cohort. It also evaluates requests from research groups that need to use CoRIS data in their projects.
Biobank: blood samples are collected (at baseline visit and annually thereafter) from all patients who give specific informed consent, from which aliquots of serum and cells are separated and stored in a centralized biobank.

For all patients enrolled in the cohort, updates of clinical data and biological parameters were requested at a periodicity of 6 ± 2 months. Follow-up was terminated upon death, change of follow-up center to one outside the cohort, or failure of the patient to appear for scheduled visits.

A database including all baseline and follow-up variables was created with the patient data and made available to all centers. After receiving the data, the coordinating center transfers them to a series of files with a common structure for all the centers. In order to connect all the information referring to the same patient and avoid duplication, a unique code combining the patient’s initials, date of birth and sex is used. The files are updated with the data sent by the hospitals every 12 months, both for new patients and for the follow-up of patients already included.

To guarantee the validity of the information and homogeneity between centers, several quality controls are carried out. All the information received at the data coordination center is subjected to a program that automatically detects inconsistencies, out-of-range data and duplicates. The cohort protocol was approved by the ethics committee of each participating hospital. When a patient is recruited, informed consent is requested. All information sent from the hospitals is anonymous.

In addition to clinical and epidemiological information, blood samples are collected from all patients who give specific informed consent, from which samples of serum and cells are separated and stored in a centralized BioBank for the entire cohort.

Manual evaluation of association rules

The development of ML systems requires training data that allows the system to find the appropriate model and configuration to make predictions about new data. In our case, the objective is to have a ML system capable of discriminating whether an AR is relevant or not. For sake of simplicity, sometimes we will call them true (relevant) or false (otherwise). We therefore need a collection of ARs classified as true or false, which will allow us to perform the training. The first problem we faced was to establish the criteria for deciding whether an AR is relevant or not. After making a thorough study of the collection of initial rules extracted by the FP-Growth algorithm, we established the following set of criteria:

An AR R (X $\Rightarrow$ Y, X = ($x_1,\cdots ,x_n$)) is relevant (True) if

$$\begin{aligned} \exists i, i \in \{1,n\}, \text {such that } relation(x_i,Y) \text {is not trivial}. \end{aligned}$$

that is, if there is any non-trivial relationship between any of the antecedent and consequent diseases.

During the annotation process the following considerations have been taken into account:

The association with the consequent can be either directly causal or inversely causal

($x_i \rightarrow Y$ and $Y \rightarrow x_i$ are considered equivalent):

For example:

acute myocardial infarction $\rightarrow$ high blood pressure: true

but also:

high blood pressure $\rightarrow$ acute myocardial infarction: true

We assume that often the cause-effect relationship is not clearly established, so we simply consider the relationship between the two elements, without establishing which is the cause and which is the effect.
If there are several antecedents and some of them are related to the consequent, even if they are not related to each other, the rule is considered true, that is

for the AR R: X $\Rightarrow$ Y, X = ($x_1,\ldots ,x_n$)), relation($x_i, x_j$) are not taken into account.

For example:

Arterial hypertension, Diabetes mellitus Dislipidemia $\rightarrow$ Acute myocardial infarction
If the association with the consequent is that it includes the antecedent or is a part of the antecedent (in a non-trivial way), the AR is true, that is

if for the AR R: X $\Rightarrow$ Y, X = ($x_1,\ldots ,x_n$)) and $\exists i, \text {such that } i \in \{1,n\}, \text {such that } x_i$ is a disease that genalizes Y or vice versa, then the AR R is considered relevant:

lung neoplasm $\rightarrow$ Non-AIDS neoplasm

In the first phase of this study, we considered that, if any of the elements included in the antecedent did not have an established relationship with the consequent, the rule was false. However, this led to eliminate many interesting rules in which an item could be included by chance, but which established a relevant relationship among the rest of the items. Therefore, we chose to classify as true those rules in which there were potentially relevant relations, even if they included some irrelevant items.

Once the general annotation guidelines were established, a HIV expert clinician (Dr. Otilia Bisbal) examined the rules, by gathering information to establish relationships between elements of the antecedent and the consequent based on their own experience in many cases, and also based on scientific literature in cases of less common relationships. Table 1 shows some data about the manually annotated ARs collection, Association Rules HIV/AIDS Dataset (ARAIDS). It is composed of 1000 rules of which, according to the adopted annotation criteria, 613 are relevant. The longest AR includes 4 antecedents and one consequent. The ARs in the collection involve 141 different health disorders, of which the one that occurs most frequently is Non-defining neoplasm AIDS, appearing 387 times.

Table 1 Some data on the Association Rules HIV/AIDS Dataset (ARAIDS), the manually validated collection of ARs.

Full size table

Semisupervised method for filtering HIV/AIDS associated rules

This section presents the EXTRAE algorithm¹⁶, a semi-supervised algorithm which requires an extremely small amount of data to be trained. This system is made up of two modules or components: one that implements an unsupervised method and another that implements a supervised one.

The set of ARs are extracted from CoRIS data by means of the FP-Growth algorithm. This algorithm generates all the possible ARs with certain constraints related to the support and confidence conditions and to the form of the rules.

The support of an AR $X \Rightarrow Y$ is the fraction of transactions that include the set of items in the antecedent or consequent of the rule:

$$\begin{aligned} support(X \Rightarrow Y) = support(X \cup Y) = \frac{count(X \cup Y)}{N} \end{aligned}$$

where N is the number of transactions in the database, and count ($X \cup Y$) the number of transactions containing all items in X (antecedent) or Y (consequent).

On the other hand, the confidence of a rule is defined as the fraction of transactions in which itemsets X and Y appear:

$$\begin{aligned} conf(X \Rightarrow Y)= \frac{support(X \cup Y)}{support(X)} \end{aligned}$$

It can be interpreted as how often a transaction that contains the itemset X also contains itemset Y.

The parameters used for this algorithm are: Min. Support of 0.0001; Min. confidence of 0.6; Max. antecedent length unlimited; Max. consequent length of 1.

Unsupervised component of EXTRAE algorithm

This component applies Fisher’s exact test to obtain the p-value corresponding to the ARs. Specifically, the p-value is used to rank a set of ARs. We rank them in ascending order and establish a threshold. Then, the n rules above that threshold (lower value) are considered as true, and the rules below that threshold (higher value) as false.

In order to compute the p-value, the data set is split into two halves called exploratory (50%) and holdout (50%). Then, the FP-Growth algorithm is applied to extract the ARs in both sets.

These two sets of rules allow us to apply the Fisher test to obtain the p-values for the rules in the holdout set. This is done by building, for each rule R: $A \longrightarrow B$ in the holdout set, a contingency table with the following data for the rule collected in the exploratory dataset:

	Rules with B ($n_2$)	Rules without B ($N - n_2$)
Rules with A ($n_1$)	Rules with A and B (k)	Rules with A (without B) ($n_1 - k$)
Rules without A ($N - n_1)$	Rules with B (without A) ($n_2 - k$)	Rules without A and B ($N - n_1 - n_2 + k$)

The p-value is computed as the hypergeometric distribution of the numbers contained in the cells of the table:

$$\begin{aligned} p(R) = \frac{\left( {\begin{array}{c}n_1\\ k\end{array}}\right) \left( {\begin{array}{c}N-n_1\\ n_2-k\end{array}}\right) }{\left( {\begin{array}{c}N\\ n_2\end{array}}\right) } \end{aligned}$$

(1)

where N is the number of rules in the exploratory set, K is the number of rules in this set containing A and B, $n_1$ is the number of rules containing A, $n_2$ is the number of rules containing B.

The value of the threshold is key to the performance of this unsupervised algorithm. Other studies using Fisher’s test for filtering the relevant ARs¹⁴ set the threshold following some heuristic such as taking the value which provides a certain number of relevant rules. However, in our case, it can be set with high accuracy using the data from the training set. The threshold is chosen so that the number of hits in the set of ARs of the training set is maximized. In this way, the unsupervised component becomes supervised to a certain degree. This is a great improvement for this method, which is later used to improve the results of the supervised component, resulting in a semi-supervised system that improves its two components.

Supervised component of EXTRAE algorithm

Thanks to the availability of manually annotated ARs, we can make use of classical ML systems to build a classifier that indicates whether a new rule is relevant. Specifically, we applied a Random Forest (default parameters provided by WEKA v3.8.2) with the following two groups of features:

The first set of features have been extracted from the content of the association rules and the output of the FP-Growth algorithm:
- Support.
- Confidence.
- Lift. The lift value is the quotient of the posterior and prior confidence of an association rule. That is, if “$\emptyset \rightarrow$ flu” has a confidence of 60% and “cough $\rightarrow$ flu” has a confidence of 72%, then the lift value (of the second rule) is 72/60 = 1.2.
- Number of antecedents. The number of antecedents of an AR “A and B $\rightarrow$ C” is the number of elements of the set S ={ A, B }.
- Number of consequents. The number of consequents of an AR “A and B $\rightarrow$ C” is the number of elements of the set S ={ C }.
The second group of features attempts to capture medical information on diseases:
- CDC. The Centers for Disease Control and Prevention (CDC) is the national public health agency of the United States and provides a list of diseases and conditions. This feature provides the normalised value of the association rule items that belong to this list.
- DIS. In this feature, the items of the association rule that are diseases but do not appear in the CDC list are detected and the normalised value is provided.
- ADD. The CoRIS dataset provides 34 adverse event types. These adverse events are a very serious type of disease like heart attck, lymphoma, cancer, etc.). This rule provides the normalised value of the presence of this type of adverse disease among the items of the association rule.
- COD. The CoRIS dataset provides 123 types of cause of death. This feature identifies whether any item in the association rule belongs to this list and uniquely identifies it.
- CIE. For each item of the association rule we have identified its ICD10 code. Due to the large number of possible labels of the complete code (more than 71K) we have only used the first character corresponding to the ”chapter”. Therefore we have created 24 features in which each rule indicates the number of items belonging to that ICD10 chapter.

Semi-supervised EXTRAE algorithm

The semi-supervised combination of the components described above results in the semi-supervised EXTRAE system capable of classifying relevant ARs with high accuracy while requiring a minimal amount of training data. This system starts from a small seed dataset S of annotated rules. According to previous studies¹⁶, this initial seed can be set to around 10 rules. This seed set is used to train the supervised component leading to a ML model. This ML model is then used to predict the class (i.e. relevant or not) for each AR in the rest of the training set. The last computed seed dataset is also used to compute an accurate p-value threshold for the unsupervised component. After sorting S according to their p-value, we choose as threshold the p-value that maximizes the hits for the seed set. Next, the unsupervised component is applied to the development set to filter the relevant ARs. Afterwards, the results of both components, supervised and unsupervised, are applied to the development dataset for enlarging the seed set. Specifically, the ARs for which the predictions of both components match, i.e. both are true or both are false (coincident set of ARs), are added to the seed set and removed from the development set. The new seed set is used to train the supervised module again, as well as to adjust the threshold of the unsupervised component. This process is repeated until the coincident set of ARs is empty (i.e. the seed set cannot grow anymore). Figure 2 shows a scheme of the system.

Results

In this section we present the experimental results obtained by the system on the ARAIDS dataset along with a parameter analysis.

For the evaluation of our system we have used a set of standard evaluation measures which focus on different aspects of the results. F-measure is a combination of the system precision and recall. AUC ROC estimates the area under the ROC curve for machine learning model comparison, where the ROC curve is a graph representing the performance of the system as its discrimination threshold for binary classification varies. We also use the PR Curve (PRC), the result of drawing the graph between the precision and the recall. This graph shows from which recall we have a degradation of the precision and vice versa. The area under this curve (AU-PRC) provides a value to compare different systems.

We have carried out a fivefold cross-validation for evaluation. Since EXTRAE is a semi-supervised system, only a portion of the training rules are used for training (seed subset), depending on the different configurations analyzed.

Table 2 shows the results of the semi-supervised method based on Incremental Learning (EXTRAE Algorithm). In order to evaluate the results, the three evaluation measures that state-of-the-art systems typically employ (F-measure, AUR-ROC, and AU-PRC) are provided. As for other values shown in the table, seed size is the original size of the training set from which the set is automatically increased. Iterations show the number of times that new rules need to be added to the seed set, in order that a set is reached to which no new rule can be added. The p-value is calculated from the seed set. Results show the performance of the system after n iterations. From the results shown in this Table, the best seed size is 35. A p-value threshold of 1.12E−13 is calculated on this seed size and after 8 iterations an f-measure of 0.84 is obtained.

Table 2 Results of EXTRAE algorithm on ARAIDS dataset using different seed sizes, based on their F-measure, AUC-ROC, and AU-PRC.

Full size table

Table 3 shows the partial results of the EXTRAE Algorithm in each iteration for the best configuration shown in Table 2. These results correspond to one of the 5 partitions used as part of the fivefold cross validation method. In the first iteration, 478 new rules are added and an F-measure of 0.72 is obtained. From the fifth iteration, the number of matching rules is greatly reduced and, in this way, the performance increases slowly until it reaches an F-measure of 0.84. In only eight iterations it increases its performance by 75%, which proves the high quality of the added rules. If we look at the AUC-ROC and AU-PRC, the final results are even higher than the F-measure. This shows that the system performance is robust when using different evaluation criteria.

Table 3 Evolution of learning from a seed set with 35 rules from one of the 5 partitions used as part of the fivefold cross validation, based on their F-Measure, AUC-ROC, and AU-PRC.

Full size table

Table 4 shows the results of the EXTRAE system compared to the two supervised and unsupervised systems of which EXTRAE is composed. For the supervised system, results are provided when using the same number of seeds (35) as used by EXTRAE, and also for the maximum number of association rules (80%) of the training set. Note that the test set is 20% of the dataset. Finally, the results of the unsupervised system are shown taking into account the best performing p-value. The unsupervised system performs significantly worse than EXTRAE, which shows the complexity of the problem. In the case of the supervised system trained with 35 rules, it was clear that the amount of training data was too low for a supervised system, but we wanted to reflect the power of EXTRAE when using the same number of rules.

Finally, EXTRAE obtains better results than the supervised system trained with 800 rules by a slight difference. This comparison is most interesting because, despite initially training with a seed of 35 rules, EXTRAE manages to select the most discriminating rules from the training set to end up with a higher quality rule set than the supervised system and, thus, improve its performance.

Table 4 Results of the best pereformance for the EXTRAE algorithm, and both supervised and unsupervised systems on ARAIDS dataset using the best configuration of EXTRAE, based on their F-measure, AUC-ROC, and AU-PRC.

Full size table

Discussion

In this study, we have proven that it is possible to perform filtering of relevant ARs with high accuracy, using a semi-supervised system capable of operating on an extremely small amount of training data. Specifically, in this case, the optimal results have been reached with only 35 annotated ARs. The results for the data considered here are even better than those obtained by the EXTRAE algorithm applied to other data in a previous study¹⁶. In the latter, the semi-supervised algorithm applied to primary care data obtained an F-measure of 0.75, AUC-ROC of 0.80 and AU-PRC of 0.81 respectively.

We have demonstrated that it is advantageous to perform association rule filtering with a semi-supervised system, and that the results of such a system are able to outperform both unsupervised and supervised systems when using a reduced amount of training data.

Association rules are a fairly simple artifact in their form, unlike, for example, texts. Therefore, a small number of parameters is enough to characterize their form, as well as aspects related to the frequency of the diseases involved and their combinations (captured by the features of support and confidence).

The medical aspects of the system are to a great extent captured by the unsupervised part. In this part, the system includes the most statistically significant ARs according to the diseases that make up each rule. The combination of both parts and their joint evolution leads the system to select the most relevant ARs.

The prediction of relevant ARs provided by the proposed model and its validation by a HIV medical expert has provided an interesting collection of HIV-related disease relationships. We describe below a number of relationships that have appeared during the validation process.

The relationship of depression with cardiovascular disease, diabetes and cancer is described in the literature and is probably due to immune factors, toxic habits, drugs, etc. and has, therefore, been considered to be true. Similarly, the relationship between psychosis with cardiovascular risk and diabetes has also been described for the same reasons and has also been considered to be true^18,19,20,21.

The relationship between acute myocardial infarction (AMI) and dementia (especially when it is of vascular origin) has also been described in the literature, so it has been considered true^22,23.

Likewise, the relationship between diabetes and dementia is also reported in the literature, so it has been considered true²⁴.

A doubtful case is the relationship between fracture and neoplasm: in extended neoplasms when bone metastases occur there may be pathological fractures, but this is not frequent, and it is not the type of fracture referred to in the CoRIS database, so it has been considered false.

The relationship between Kaposi’s sarcoma and bronchial neoplasm has been considered true since Kaposi’s sarcoma is a neoplasm that can affect the lung.

The relationship between cachectic syndrome (cachectic sd) and neoplasms has been considered false, because even though the latter can produce a cachectic sd, in the CoRIS database it is specified that it refers to the former due to HIV or wasting syndrome in particular.

Regarding the term “secondary malignant neoplasm of other specified sites”, if it appears in an AR whose consequent is lung or bladder cancer or head and neck cancer, the AR has been considered true because it can be a metastasis of these neoplasms. However, it has been considered false if the consequent of AR is recurrent bacterial pneumonia, because although the latter increases the risk of lung neoplasm, it does not increase the risk of any neoplasm.

The relationship between lactic acidosis and diabetes and cancer has been described²⁵ so it has been considered true.

Appendix A includes a series of relationships considered to be true since they have been confirmed in the literature, together with some references to them. The appendix also gathers relationships that have been considered false either because they lack consistent support in the literature or because they have been ruled out by the literature.

Data availability

The datasets supporting the conclusions of this article are included within the article and its tables, as well as in the supplementary material. The original HIV data from the CoRIS Cohort can be requested to this organization under the corresponding agreement.

References

Agrawal, R., Imielinski, T. & Swami, A. N. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993., 207–216 (1993).
Imamura, T. et al. A technique for identifying three diagnostic findings using association analysis. Med. Biol. Eng. Comput. 45, 51–59 (2007).
Article PubMed Google Scholar
Park, S. H., Jang, S. Y., Kim, H. & Lee, S. W. An association rule mining-based framework for understanding lifestyle risk behaviors. PLoS ONE 9, e88859 (2014).
Article ADS PubMed PubMed Central Google Scholar
Rao, P. S. & Devi, T. U. Applicability of apriori based association rules on medical data. Int. J. Appl. Eng. Res. 12, 9451–9458 (2017).
Google Scholar
Manolitsis, I. et al. Using association rules in antimicrobial resistance in stone disease patients. Stud. Heal. Technol. Inform. 462–465 (2022).
Ou, J. & Zhang, J. Data mining and meta-analysis of psoriasis based on association rules. J. Healthc. Eng. 2022 (2022).
Babu, S. A., Raj, R. J. S., Varalatchoumy, M., Gopila, M. & Justin, B. V. F. Novel approach for predicting covid-19 symptoms using arm based apriori algorithm. In 2022 6th International Conference on Computing Methodologies and Communication (ICCMC), 1577–1580 (IEEE, 2022).
Nasiri, M. et al. Risk factors affecting death from hospital-acquired infections in trauma patients: Association rule mining. J. Health Manag. Inform. 8, 27–33 (2021).
Google Scholar
Han, J., Pei, J. & Yin, Y. Mining frequent patterns without candidate generation. ACM Sigmod Rec. 29, 1–12 (2000).
Article Google Scholar
Prajapati, D. J., Garg, S. & Chauhan, N. Interesting association rule mining with consistent and inconsistent rule detection from big sales data in distributed environment. Futur. Comput. Inform. 2, 3 (2017).
Google Scholar
García, E., Romero, C., Ventura, S. & Calders, T. Drawbacks and solutions of applying association rule mining in learning management systems. In Proceedings of the International Workshop on Applying Data Mining in e-Learning (ADML 2007), Crete, Greece, 13–22 (sn, 2007).
Dahbi, A., Jabri, S., Ballouki, Y. & Gadi, T. A new method to select the interesting association rules with multiple criteria. Int. J. Intell. Eng. Syst. 10, 191–200 (2017).
Google Scholar
Liu, G., Zhang, H. & Wong, L. Controlling false positives in association rule mining. Proc. VLDB Endow. 5, 145–156 (2011).
Article Google Scholar
Webb, G. I. Discovering significant patterns. Mach. Learn. 71, 131 (2008).
Article Google Scholar
Saravanan, R. & Sujatha, P. A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. In 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS), 945–949 (IEEE, 2018).
Sánchez-de-Madariaga, R., Martínez-Romo, J., Escribano, J. M. C. & Araujo, L. Semi-supervised incremental learning with few examples for discovering medical association rules. BMC Med. Inform. Decis. Mak. 22, 20 (2022).
Article PubMed PubMed Central Google Scholar
Sobrino-Vegas, P. et al. La cohorte de la red española de investigación en sida y su biobanco: Organización, principales resultados y pérdidas al seguimiento. Enfermedades Infecciosas y Microbiol. Clínica 29, 645–653 (2011).
Article Google Scholar
De Hert, M. et al. Cardiovascular disease and diabetes in people with severe mental illness position statement from the European Psychiatric Association (EPA), supported by the European Association for the Study of Diabetes (EASD) and the European Society of Cardiology (ESC). Eur. Psychiatry 24, 412–424 (2009).
Article PubMed Google Scholar
Holt, R. I. Association between antipsychotic medication use and diabetes. Curr. Diabetes Rep. 19, 1–10 (2019).
Article Google Scholar
Howell, S., Yarovova, E., Khwanda, A. & Rosen, S. D. Cardiovascular effects of psychotic illnesses and antipsychotic therapy. Heart 105, 1852–1859 (2019).
Article CAS PubMed Google Scholar
Sánchez, M. C., Escurriola, M. F., Baquero, D. B., Arno, A. G. & Callol, J. A. V. Psicosis, riesgo cardiovascular y mortalidad asociada: vamos por el buen camino?. Clínica e Investig. en Arter. 26, 23–32 (2014).
Google Scholar
Deckers, K. et al. Coronary heart disease and risk for cognitive impairment or dementia: Systematic review and meta-analysis. PLoS ONE 12, e0184244 (2017).
Article PubMed PubMed Central Google Scholar
Sundbøll, J. Depression, stroke, and dementia in patients with myocardial infarction. Dan. Med. J. 65, B5423 (2018).
PubMed Google Scholar
Xue, M. et al. Diabetes mellitus and risks of cognitive impairment and dementia: A systematic review and meta-analysis of 144 prospective studies. Ageing Res. Rev. 55, 100944 (2019).
Article CAS PubMed Google Scholar
Kraut, J. A. & Madias, N. E. Lactic acidosis. New Engl. J. Med. 371, 2309–2319 (2014).
Article PubMed Google Scholar

Download references

Acknowledgements

This study has been partially supported by the Spanish Ministry of Science and Innovation within the DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32, the OBSER-MENH Project (MCIN/AEI/10.13039/501100011033 and UE (“NextGenerationEU”/PRTR)) under Grant TED2021-130398B-C21 and the project RAICES (IMIENS 2022), PI18CIII/00004 “Infobanco para uso secundario de datos basado en estándares de tecnología y conocimiento: implementación y evaluación de un infobanco de salud para CoRIS (Info-bank for the secondary use of data based on technology and knowledge standards: implementation and evaluation of a health info-bank for CoRIS) - SmartPITeS” and PI18CIII/00019 - PI18/00890 - PI18/00981 “Arquitectura normalizada de datos clínicos para la generación de infobancos y su uso secundario en investigación: solución tecnológica (Clinical data normalized architecture for the generation of info-banks and their secondary use in research: technological solution) - CAMAMA 4” from Fondo de Investigación Sanitaria (FIS) Plan Nacional de I+D+i.

The RIS cohort (CoRIS) is supported by the Instituto de Salud Carlos III through the Red Temática de Investigación Cooperativa en Sida (RD06/006, RD12/0017/0018 and RD16/0002/0006) as part of the Plan Nacional R+D+I and co-financed by ISCIII-Subdirección General de Evaluación and el Fondo Europeo de Desarrollo Regional (FEDER). The list of members of the Cohort of the Spanish HIV Research Network (CoRIS) is included in the Supplementary Material. Additional relationships between HIV-related diseases confirmed or discarded are included as Supplementary Material.

This study would not have been possible without the collaboration of all patients, medical and nursing staff and data mangers who have taken part in the Project.

Author information

A list of authors and their affiliations appears at the end of the paper.

Authors and Affiliations

Languages and Information Systems Dpt., ETS Ingeniería Informática (UNED), Juan del Rosal 16, 28040, Madrid, Spain
Lourdes Araujo & Juan Martinez-Romo
Hospital Universitario 12 de Octubre, Instituto de Investigación I+12, Madrid, Spain
Otilia Bisbal
Telemedicine and e-Health Research Unit, Instituto de Salud Carlos III, Monforte de Lemos 5, 28029, Madrid, Spain
Ricardo Sanchez-de-Madariaga
Instituto Mixto UNED-ISCIII IMIENS, 28029, Madrid, Spain
Lourdes Araujo, Juan Martinez-Romo & Ricardo Sanchez-de-Madariaga
Hospital General Universitario de Alicante, Alicante, Spain
Joaquín Portilla, Irene Portilla, Esperanza Merino, Gema García, Iván Agea, José Sánchez-Payá, Juan Carlos Rodríguez, Livia Giner, Sergio Reus, Vicente Boix, Diego Torrus, Verónica Pérez & Julia Portilla
Hospital Universitario de Canarias, San Cristóbal de la Laguna, Spain
Juan Luís Gómez, Jehovana Hernández, Ana López Lirola, Dácil García, Felicitas Díaz-Flores, M. Mar Alonso, Ricardo Pelazas & M. Remedios Alemán
Hospital Universitario Central de Asturias, Oviedo, Spain
Víctor Asensi, María Eugenia Rivas Carmenado & Tomás Suarez-Zarracina
Hospital Universitario 12 de Octubre, Madrid, Spain
Federico Pulido, Rafael Rubio, Otilia Bisbal, M. Asunción Hernando, David Rial, María de Lagarde, Octavio Arce, Adriana Pinto, Laura Bermejo, Mireia Santacreu, Roser Navarro & Candela Gonzalez
Servicio de Enfermedades Infecciosas, Hospital Universitario Donostia, Instituto de Investigación BioDonostia, Donostia-San Sebastián , Spain
Jose Antonio Iribarren, M. José Aramburu, Xabier Camino, Miguel Ángel von Wichmann, Miguel Ángel Goenaga, M. Jesús Bustinduy, Harkaitz Azkune, Maialen Ibarguren, Xabier Kortajarena, Ignacio Álvarez-Rodriguez, Leire Gil & Lourdes Martínez
Hospital General Universitario De Elche, Elche, Spain
Félix Gutiérrez, Catalina Robledano, Mar Masiá, Sergio Padilla, Araceli Adsuar, Rafael Pascual, Marta Fernández, Antonio Galiana, José Alberto García, Xavier Barber, Vanessa Agullo, Javier Garcia Abellán, Reyes Pascual, Guillermo Telenti, Lucia Guillén & Ángela Botella
Hospital Universitari Germans Trias i Pujol (Can Ruti), Badalona, Spain
Roberto Muga, Arantza Sanvisens & Daniel Fuster
Hospital General Universitario Gregorio Marañón, Madrid, Spain
Juan Berenguer, Isabel Gutierrez, Juan Carlos López, Margarita Ramírez, Belén Padilla, Paloma Gijón, Teresa Aldamiz-Echevarría, Francisco Tejerina, Cristina Diez, Leire Pérez, Chiara Fanciulli & Saray Corral
Hospital Universitari de Tarragona Joan XXIII, Tarragona, Spain
Francesc Vidal, Anna Martí, Joaquín Peraire, Consuelo Viladés, Montserrat Vargas, Montserrat Olona, Anna Rull, Verónica Alba, Elena Yeregui, Jenifer Masip, Graciano García-Pardo, Frederic Gómez Bertomeu & Sonia Espineira
Hospital Universitario y Politécnico de La Fe, Valencia, Spain
Marta Montero, Sandra Cuéllar, Marino Blanes, María Tasias, Eva Calabuig, Miguel Salavert, Juan Fernández & Inmaculada Segarra
Hospital Universitario La Paz/IdiPAZ, Madrid, Spain
Juan González-García, Ana Delgado-Hierro, José Ramón Arribas, Victor Arribas, Jose Ignacio Bernardino, Carmen Busca, Joanna Cano, Julen Cardiñanos, Juan Miguel Castro, Luis Escosa, Iker Falces, Pedro Herranz, Victor Hontañón, Milagros García, Alicia González-Baeza, Ma Luz Martín-Carbonero, Mario Mayoral, Ma Jose Mellado, Rafael Micán, Rosa de Miguel, Rocío Montejano, Ma Luisa Montes, Victoria Moreno, Luis Ramos, Berta Rodés, Talía Sainz, Elena Sendagorta & Eulalia Valencia
Hospital San Pedro Centro de Investigación Biomédica de La Rioja (CIBIR), Logroño, Spain
Jose Ramón Blanco, Laura Pérez-Martínez, José Antonio Oteo, Valvanera Ibarra, Luis Metola & Mercedes Sanz
Hospital Universitario Miguel Servet, Zaragoza, Spain
Piedad Arazo & Gloria Sampériz
Hospital Universitari Mutua Terrassa, Terrassa, Spain
David Dalmau, Marina Martinez, Angels Jaén, Montse Sanmartí, Mireia Cairó, Javier Martinez-Lacasa, Pablo Velli, Roser Font, Mariona Xercavins, Noemí Alonso & Francesco Aiello
Complejo Hospitalario de Navarra, Pamplona, Spain
María Rivero, Beatriz Piérola, Maider Goikoetxea, María Gracia, Carlos Ibero, Estela Moreno & Jesús Repáraz
Parc Taulí Hospital Universitari, Sabadell, Spain
Gemma Navarro, Manel Cervantes Garcia, Sonia Calzado Isbert, Marta Navarro Vilasaro & Belen Lopez Garcia
Hospital Universitario de La Princesa, Madrid, Spain
Ignacio de los Santos, Alejandro de los Santos, Jesús Sanz, Lucio García-Fraile, Enrique Martín, Ildefonso Sánchez-Cerrillo, Marta Calvet, Ana Barrios, Azucena Bautista, Carmen Sáez, Marianela Ciudad & Ángela Gutiérrez
Hospital Universitario Ramón y Cajal, Madrid, Spain
Santiago Moreno, Santos del Campo, José Luis Casado, Fernando Dronda, Ana Moreno, M. Jesús Pérez, Sergio Serrano, Ma Jesús Vivancos, Javier Martínez-Sanz, Alejandro Vallejo, Matilde Sanchez, Jose Antonio Pérez-Molina & José Manuel Hermida
Hospital General Universitario Reina Sofía, Murcia, Spain
Enrique Bernal, Antonia Alcaraz, Joaquín Bravo, Ángeles Muñoz, Cristina Tomás, Mónica Martínez & M. Carmen Villalba
Hospital Nuevo San Cecilio, Granada, Spain
Federico García, Clara Martínez, José Hernández, Leopoldo Muñoz Medina, Marta Álvarez, Natalia Chueca, David Vinuesa, Adolfo de Salazar, Ana Fuentes, Emilio Guirao, Laura Viñuela, Andrés Ruiz-Sancho & Francisco Anguita
Centro Sanitario Sandoval, Madrid, Spain
Jorge Del Romero, Montserrat Raposo, Carmen Rodríguez, Teresa Puerta, Juan Carlos Carrió, Mar Vera, Juan Ballesteros, Oskar Ayerdi, Begoña Baza & Eva Orviz
Hospital Clínico Universitario de Santiago, Santiago de Compostela, Spain
Antonio Antela & Elena Losada
Hospital Universitario Son Espases, Palma de Mallorca, Spain
Melchor Riera, María Peñaranda, M. Angels Ribas, Antoni A. Campins, Mercedes Garcia-Gazalla, Francisco J. Fanjul, Javier Murillas, Francisco Homar, Helem H. Vilchez, Luisa Martin & Antoni Payeras
Hospital Universitario Virgen de la Victoria, Málaga, Spain
Jesús Santos, María López, Crisitina Gómez, Isabel Viciana & Rosario Palacios
Hospital Universitario Virgen del Rocío, Seville, Spain
Luis Fernando López-Cortés, Nuria Espinosa, Cristina Roca & Silvia Llaves
Hospital Universitario de Bellvitge, Hospitalet de Llobregat, Spain
Juan Manuel Tiraboschi, Arkaitz Imaz, Ana Karina Silva, María Saumoy & Sofía Catalina Scévola
Hospital Universitario Valle de Hebrón, Barcelona, Spain
Adrián Curran, Vicenç Falcó, Jordi Navarro, Joaquin Burgos, Paula Suanzes, Jorge García, Vicente Descalzo, Patricia Álvarez, Bibiana Planas, Marta Sanchiz & Lucía Rodríguez
Hospital Costa del Sol, Marbella , Spain
Julián Olalla, M. José Sánchez, Javier Pérez, Alfonso del Arco, Javier de la Torre & José Luis Prada
Hospital General Universitario Santa Lucía, Cartagena, Spain
Onofre Juan Martínez, Lorena Martinez, Francisco Jesús Vera, Josefina García, Begoña Alcaraz & Antonio Jesús Sánchez Guirao
Complejo Hospitalario Universitario a Coruña (Chuac), A Coruña, Spain
Alvaro Mena, Angeles Castro, Berta Pernas, Pilar Vázquez & Soledad López
Hospital Universitario Basurto, Bilbao, Spain
Sofía Ibarra, Guillermo García, Josu Mirena, Oscar Luis Ferrero, Josefina López, M. Mar Cámara, Mireia de la Peña, Miriam Lopez, Iñigo Lopez, Itxaso Lombide, Victor Polo & Joana de Miguel
Hospital Universitario Virgen de la Arrixaca, El Palmar, Spain
Carlos Galera, Marian Fernández, Helena Albendin, Antonia Castillo, Asunción Iborra, Antonio Moreno, M. Angustias Merlos & Asunción Vidal
Hospital de la Marina Baixa, La Vila Joiosa, Spain
Concha Amador, Francisco Pasquau, Concepcion Gil & Jose Tomás Algado
Hospital Universitario Infanta Sofía, San Sebastián de los Reyes, Spain
Inés Suarez-García, Eduardo Malmierca, Patricia González-Ruano, M. Pilar Ruiz, José Francisco Pascual, Elena Sáez & Luz Balsalobre
Hospital Universitario de Jaén, Jaén, Spain
M. Villa López, Mohamed Omar, Carmen Herrero & M. Amparo Gómez
Hospital Universitario San Agustín, Avilés, Spain
Miguel Alberto de Zarraga & Desiré Pérez
Hospital Clínico San Carlos, Madrid, Spain
Vicente Estrada, Nieves Sanz, Noemí Cabello, Jorge Vergas García, Maria Jose Núñez & Iñigo Sagastagoitia
Hospital Universitario Fundación Jiménez Díaz, Madrid, Spain
Miguel Górgolas, Alfonso Cabello, Beatriz Álvarez, Laura Prieto & Irene Carrillo
Hospital Universitario Príncipe de Asturias, Alcalá de Henares, Spain
José Sanz, Alberto Arranz, Cristina Hernández & María Novella
Hospital Clínico Universitario de Valencia, Valencia, Spain
M. José Galindo & Ana Ferrer
Hospital Reina Sofía, Córdoba, Spain
Antonio Rivero Román, Inma Ruíz, Antonio Rivero Juárez, Pedro López, Isabel Machuca, Mario Frias, Ángela Camacho, Ignacio Pérez, Diana Corona, Ignacio Pérez & Diana Corona
Hospital Universitario Severo Ochoa, Leganés, Spain
Miguel Cervero & Rafael Torres
Nuestra Señora de Valme, Seville, Spain
Juan Antonio Pineda, Pilar Rincón, Juan Macías, Luis Miguel Real, Anais Corma, Marta Fernández & Alejandro Gonzalez-Serna
Hospital Álvaro Cunqueiro, Vigo, Spain
Eva Poveda, Alexandre Pérez, Luis Morano, Celia Miralles, Antonio Ocampo, Guillermo Pousada & Lucía Patiño
Hospital Clínico Universitario de Valladolid, Valladolid, Spain
Carlos Dueñas, Sara Gutiérrez, Elena Tapia, Cristina Novoa, Xjoylin Egües & Pablo Telleria

Authors

Lourdes Araujo
View author publications
You can also search for this author in PubMed Google Scholar
Juan Martinez-Romo
View author publications
You can also search for this author in PubMed Google Scholar
Otilia Bisbal
View author publications
You can also search for this author in PubMed Google Scholar
Ricardo Sanchez-de-Madariaga
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Contributions

L.A., J.M. and R.S. designed the semi-supervised model, J.M. performed the experiments, L.A. and J.M. wrote a draft of the article, O.B. evaluated the relationships between diseases encountered. All authors analyzed the results and reviewed the manuscript.

Corresponding author

Correspondence to Lourdes Araujo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Araujo, L., Martinez-Romo, J., Bisbal, O. et al. Discovering HIV related information by means of association rules and machine learning. Sci Rep 12, 18208 (2022). https://doi.org/10.1038/s41598-022-22695-y

Download citation

Received: 28 June 2022
Accepted: 18 October 2022
Published: 28 October 2022
DOI: https://doi.org/10.1038/s41598-022-22695-y

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

	Rules with B (\(n_2\))	Rules without B (\(N - n_2\))
Rules with A (\(n_1\))	Rules with A and B (k)	Rules with A (without B) (\(n_1 - k\))
Rules without A (\(N - n_1)\)	Rules with B (without A) (\(n_2 - k\))	Rules without A and B (\(N - n_1 - n_2 + k\))