Graph neural network recommendation algorithm based on improved dual tower model

In this era of information explosion, recommendation systems play a key role in helping users to uncover content of interest among massive amounts of information. Pursuing a breadth of recall while maintaining accuracy is a core challenge for current recommendation systems. In this paper, we propose a new recommendation algorithm model, the interactive higher-order dual tower (IHDT), which improves current models by adding interactivity and higher-order feature learning between the dual tower neural networks. A heterogeneous graph is constructed containing different types of nodes, such as users, items, and attributes, extracting richer feature representations through meta-paths. To achieve feature interaction, an interactive learning mechanism is introduced to inject relevant features between the user and project towers. Additionally, this method utilizes graph convolutional networks for higher-order feature learning, pooling the node embeddings of the twin towers to obtain enhanced end-user and item representations. IHDT was evaluated on the MovieLens dataset and outperformed multiple baseline methods. Ablation experiments verified the contribution of interactive learning and high-order GCN components.


www.nature.com/scientificreports/
We propose an interactive high-order twin-tower model (IHDT) that considers the speed advantage of the twin-tower model and the accuracy advantage of GNNs.This model is built on a heterogeneous graph of user products and uses an interactive learning mechanism to inject product (user) features related to users (products) into the corresponding encoder.The model uses high-order feature expression based on graph convolutional networks to aggregate multipored neighbor information of nodes and enhance the vector representation of users and products.Finally, the inner product between the augmented representations is calculated as the predicted value of the recommendation system.The effectiveness of the model was verified on public datasets, and the results show that IHDT achieves state-of-the-art performance comparable to multiple powerful benchmarks.The IHDT model thus offers interactive modeling of user products and graph-based high-order feature learning, and as such provides new ideas for large-scale recommendation systems considering both accuracy and diversity.

Model Problem definition
Real-world data often contain multiple types of objects and their interactions, which makes it difficult to model the data using homogeneous information networks, as representing learning using a homogeneous information network captures only some features.However, these heterogeneous data can be naturally modeled using heterogeneous information networks, in which multiple types of objects and interactions can coexist, containing rich network structure information and semantic information.The relevant definitions are shown below.
Definition 1 Information Network.An information network is represented as G = (V, E, ϕ, ψ) , consisting of a set of objects V, a set of links E, a mapping function of object types ϕ : V → A , and a mapping function of relationship types ψ : E → R .A is the set of object types, and R is the set of relationship types.Definition 2 Homogeneous/Heterogeneous Information Network.An information network is heterogeneous if the number of object types is |A| > 1 or the number of relationship types is |R| > 1 .Otherwise, it is called a homogeneous information network.
A simple heterogeneous information network is shown in Fig. 1.It contains three types of nodes: user, item, and brand, and the relationships between them.
Definition 3 Network Model.The network model 45 can be represented as T G = (A, R) .It is a directed graph with object type A as nodes and relationship type R as edges.

Definition 4 Meta-path.
A meta-path is a path of two classes of objects, A and R, in the network pattern T G = (A, R) , which can be expressed as In the heterogeneous information network, as shown in Fig. 2, different meta-paths are selected for user u 1 to obtain different higher-order connectivity.

IHDT model
The IHDT model (see Fig. 3 for the architecture diagram) is based on interactive, high-order learning mechanisms to improve the dual tower model and the accuracy of the recommendation system.The IHDT model introduced in this paper learns node representations and applies them to recommendations in the following steps.
(1) Select different-pattern meta-paths i in the data source to form different-pattern subgraphs G i and finally merge all subgraphs to form the recommendation base graph G (i.e., G = G i ). (a) Higher-order connectivity based on user-item meta-paths of u 1 ; (b) Higher-order connectivity based on user-attribute-item meta-paths of u 1 .www.nature.com/scientificreports/ (2) Use random initialization or a pre-training model to obtain different node initialization expressions of G ; the i-th user node is expressed as e u , the k-th attribute of this user is expressed as e k u , the j-th item is expressed as e i , and the h-th attribute of this item is expressed as e h i .
(3) Using the interaction learning mechanism, the interaction mechanism expressions a u and a v of the user and item nodes are derived as the two inputs of the dual tower model.(4) Higher-order mechanism expression learning based on GCN converges and aggregates the multifaceted representations of nodes to obtain the final representations e * u and e * i of users and items.(5) Finally, the inner product of e * u and e * i is calculated and used as the final prediction value of the model.

Method
The construction of the IHDT model, including implementing the interactive and high-order learning mechanisms, consists of the following steps.

Interactive learning
The framework of the proposed model is shown in Fig. 3. Add user feature vector a u and product feature vector a v to the user and product input terminals, respectively.The user and item embedding representations z i u ,z j v ∈ R 1×d ′ under specific user-item interaction are learned through the interaction mechanism G.The connection layer passes z u ∈ R nu×d and z v ∈ R nv×d into the user tower and product tower, and outputs the vector representations of users and products e u and e i : where a u and a v are derived from the information captured by the interaction behaviors in the corpus.

High-order learning
The final output vectors e u and e i obtained in the dual tower model can only represent the first-order information of the current user (i.e., the rich feature information of the user (item)); the first-order interaction between the user and the item can be obtained in the dual tower model, but the convergence of higher-order information between the user and the item cannot be achieved.Using e u and e i as the input of GCN, first-order and high-order propagation aggregation rules are designed to obtain the final embedding representations of users and products e * u and e * i .GCNs can achieve this purpose by propagating the convergence, ensuring the richness of e u and e i as well as the accuracy brought by the prediction: where W 1 , W 2 ∈ R d ′ ×d is the trainable weight matrix used to extract useful propagation information, and d ′ is the transformation dimension, while p ui is set as the Laplacian matrix of the graph, and N u and N i denote the first-order neighbors of user u and item i, respectively.
On top of first-order propagation and convergence, higher-order propagation and convergence are embedded into the propagation layers by stacking l.Users (and items) can receive messages propagated from their L-Hop neighbors.Then, the final embedding representation of the final l-th layer can be obtained: where E (l) ∈ R (N+M)d l is the representation obtained after the users, and items are embedded in the propagation convergence l layer.

Model prediction
After propagation and convergence through l layers, the embedding representation of each user (item) is obtained at each layer, and the final user e * u and item e * i embedding representations are obtained by a simple join operation: Then, preference prediction is performed by computing the inner product of the final embedding representations of users and items: Finally, model optimization is performed based on the Bayesian personalized ranking (BPR) loss: where O represents sample set u, I represents positive samples, and u, i − represents negative samples.Finally, the above BPR loss function can be minimized, the parameters in the whole model can be optimized end-to-end in the form of backpropagation, and all parameters converge to a fixed value with the optimization process of the model. (1) (

Overall process
According to the above, the overall training process of the IHDT model is as follows.1.
For each dataset, 90% of the historical interactions of each user were randomly selected to form the training set, and the rest were used as the test set.Each pair of training and test sets is complementary, and recombining them can yield the initial dataset.From the training set, 10% of the interactions were randomly selected as the validation set to tune the hyperparameters.Each observed user-item interaction was treated as a positive instance.Then, a negative sampling strategy was used to pair it with a negative item the user had not used before.

Evaluation metrics
For each user in the test set, we considered all items with which the user had no interaction as negative samples and items that the user had interactions with as positive samples.We utilized four commonly used performance evaluation metrics: precision, recall, discounted cumulative gain (NDCG), and hit rate (HR).
We briefly introduce these metrics as follows.
Precision@K is used to evaluate the proportion of products related to the user among the top K products recommended to the user.It is calculated as follows: where d i (K) is the intersection of the top K products recommended to the user and the products in the test set that the user interacts with.
Recall@K is used to evaluate the proportion of the products that are related to the user to those that are recommended to the user: NDCG@K is used to evaluate the accuracy of the ranking results.Assuming that the length of the recommendation list is K , NDCG@K shows the gap between the ranking list and the real user interaction list.It is calculated as follows: where r k = 1 indicates that the Kth item is the item that the user favors; otherwise, r k = 0 .Z k is a normalization constant.
HR@K is a commonly used indicator to measure recall rate and is calculated as follows: where N is the total number of users, and hits(i) represents whether the value accessed by the i-th user is in the recommended list.If yes, then hits(i) equals 1; otherwise, it is 0.

Baselines
The proposed IHDT was compared with the following top-k recommendation algorithms to demonstrate its effectiveness.
MF 4 .This is a matrix decomposition optimized by Bayesian personalized ranking (BPR) loss, which uses only the direct user-item interaction as the objective value of the interaction function.
DMF 46 .This method is a matrix decomposition model with a neural network architecture.A user-item matrix with explicit ratings and non-preference implicit feedback is constructed and used as input.A deep structure learning architecture is proposed to learn a generic low-dimensional space representing users and items.
GCMC 18 .This approach considers the matrix decomposition of recommendation systems from a link prediction perspective, represented by a bipartite user-item graph with labeled edges indicating the observed ratings.This way, a graph autoencoder framework is proposed based on micro-message transferable on bi-directional interaction graphs.
NeuMF 47 .This approach is an advanced neural CF model that uses multiple hidden layers above the element level and connections of user and item embeddings to capture their nonlinear feature interactions.
ConvNCF 48 .This method uses outer products to explicitly model pairwise correlations between embedding space dimensions.A more expressive and semantically sound 2-D interaction graph is obtained using the outer product on top of the embedding layer.On top of the interaction graph, a convolutional neural network (CNN) is used to learn higher-order correlations between the embedding dimensions.( 7) www.nature.com/scientificreports/NGCF 49 .This approach can explicitly encode the higher-order user-item interactions into the representation vector, effectively injecting collaborative signals into the embedding process in an explicit way to improve the representation and, thus, the overall recommendation.
LightGCN 50 This method removes the feature transformation and nonlinear activation in GCN, making the model more concise and efficient.LightGCN only retains the neighbor aggregation operation of GCN, that is, layer-by-layer propagation and aggregation through the user-item interaction matrix.At the same time, a new vertical regularization term is introduced to punish overly complex user and item representations to prevent overfitting.In terms of efficiency, a sparse graph is constructed by discarding some connections to speed up training and inference.Additionally, two dropping strategies, random dropping and weighted sampling, are introduced.LightGCN generally achieves the best balance of accuracy and efficiency by designing a simple model and regularization strategy.It retains the key mechanisms GCN requires while removing unnecessary components, which is its main advantage over other GCN models.
BM3* 51 This method is different from the previous method in that it is a multi-modal recommendation method.It designs a multi-modal contrastive loss function that simultaneously optimizes three goals: reconstructing user-item interaction graphs, aligning learned features between different modalities, and reducing the gap between different augmented view representations from a specific modality.dissimilarity.(*Note: The dataset we tested is a traditional recommendation system dataset which does not provide much additional information for multi-modal learning, so the effect may not ideal.This test is only to illustrate the multi-modal recommendation model does not have advantages for traditional recommendation tasks).

Parameter setting
This study implemented the IHDT model in TensorFlow and Pytorch.The parameters were randomly initialized in TensorFlow for the improved dual tower model, and the number of connected layers was set to two by default.The output vector in the dual tower model was used as the initialization parameter of IHDT in Pytorch.The embedding dimensions of the model default were set to 64, the normalization coefficient was set to 1e−5 by default, the learning rate was uniformly set to 1e−4, the node loss ratio was set to 0.1, the message loss ratio was set to 0.1, the number of layers was set to 3, the layers in the Movielens-1M dataset batch size were set to 1024, the epoch was set to 1000, the dataset batch size in the MovieLens-10M was set to 4096, and epoch was set to 400.Additionally, an early stop policy was set (i.e., if the evaluation indicator recall did not increase within 50 consecutive epochs, it was stopped early).

Comparison of experimental results
Table 2 shows the results of Top-20 recommendations on MovieLens-1M and MovieLens-10M datasets for the IHDT and other benchmark algorithms described in this paper.The table shows that the IHDT algorithm outperformed the comparison algorithms in overall performance and was superior to the comparison algorithms for all indicators.The accuracy, recall, and NDCG were all improved.
Table 2 shows that the IHDT algorithm considerably improved over the benchmark algorithms.The IHDT and NGCF algorithms also performed better.This paper speculates that it may be due to the high sparsity of the MovieLens-10M dataset used in the study.Both the IHDT and NGCF models could aggregate the feature information of high-order neighbors.IHDT in the dual tower model also performed deep semantic interaction, so it still performed well with sparser data.MF and DMF were less effective on sparse datasets.Meanwhile, although NeuMF could also aggregate neighbor information, the converged information was insufficient for higher-order neighbors, resulting in moderate performance.

Analysis of recommendation list length impact
After completing the basic experiments to investigate the effect of different recommendation list lengths on model performance, we continued to compare the model's performance with different recommendation list lengths on Table 2. Algorithm @ 20 Performance of models with Top-20 on MovieLens dataset.Significant values are in bold.
Figure 4 shows the performance trends of the IHDT model on Movielens-1M and MovielensLens-10M datasets for recommendation list lengths from 10 to 100 using line graphs.Based on the above experimental results, it was found that as the recommendation list length increased, both the recall indicator and NDCG indicator showed an increasing trend; the recall indicator increased significantly.The precision indicator decreased smoothly, indicating that the algorithm performed well when the recommendation list length was increased.

Parameter sensitivity study
When the IHDT model was applied to a recommendation system, the main parameters included the number of model layers, the dimensionality d of the user and item representation vector (embedding), the learning rate (lr), and the L2 regularization coefficient.Taking the Movielens-1M dataset as an example, this study investigated the parameter variation of the IHDT algorithm with different parameter values.

Number of model layers
The method in this paper selects the 1st to Kth-order higher-order neighbor information for fusion.To verify the influence of the selected order on the experimental results, different orders were selected for the Movielen-1M dataset.The experimental results are shown in Table 3.
The details in Table 3 reflect that when the model order is small, increasing the order of the model improves the model performance effectively, and the performance of IHDT-3 is much higher than that of IHDT-1 and IHDT-2.The high-order connection better captured the collaborative relationship between node messages, improved the node feature representation, and improved the model performance; when the model order continued to increase, there was a dramatic decrease in the model performance.The table shows that the model performance is lowest when the IHDT model order is 4.This may be due to overfitting caused by introducing noisy information into the representation learning that could be caused by applying an excessively deep architecture.Therefore, it is necessary to select the appropriate model order to determine the maximum performance of the model.

Dimensionality of user and item representation vectors
For the dimension selection of the final representation vector of users and items, the dimensions of the final representation vector were 32, 64, and 128.Table 4 shows the experimental results.
Based on the experimental data in Table 4, the following findings are made: the IHDT indicators do not trend gradually upward with increasing dimensionality of the representation vector.The best performance of the proposed model was when the value of dimensionality was 64, indicating that the model needs sufficient

Learning rate
The learning rate is a principal factor in ensuring the model's performance.Too small a learning rate results in excessive convergence time or a failure to converge because the gradient disappeared; too large a learning rate puts the model at the risk of over-approximating the minimum and failing to converge.Therefore, this study used three learning rates, 1e−5, 3e−5, and 5e−5, to determine the sensitivity of this parameter.The experimental results are shown in Table 5.
Table 5 shows that the model performed better when the learning rate was 1e−5.However, when the learning rate was 3e−5 or 5e−5, its performance decreased significantly, and although the training time was significantly reduced, the performance was poor.

L2 regularization factor
Deep learning models usually have superb prediction and fitting capabilities, and thus, they are prone to overfitting.Various regularization techniques can mitigate overfitting to some extent.We used L2 regularization to adjust the overfitting of the model.The experimental results are shown in Fig. 5.
Figure 5 shows that the model achieved a better result when the regularization coefficient was 0.1.The model was robust in response to L2 regularization (i.e., the change in L2 regularization within a specific range did not significantly affect the model's effectiveness), which also indirectly reflects the stability of the model.(1) This paper proposes an interactive dual tower model; that is, when the dual tower model is constructed, the user (item)-related items (users) are injected into the user (item) features for interaction to improve the accuracy.(2) In this paper, we propose to improve the accuracy of the recommendation system through two design improvements.First, we use GNNs with higher-order learning mechanisms on the graph structure data, which shows a powerful feature extraction ability.Second, we converge the information features obtained from the dual tower model as the initial state of the GNN for higher-order fusion to improve the richness of user and item information.To verify the effectiveness and rationality of these two design improvements, we conducted corresponding ablation experiments and compared the difference in performance between the traditional and complete models.The specific completed comparative ablation experiments are as follows.
IHDT noGCN : removed the GCN module; it verifies the effect of the model only in the dual tower.
IHDT noEV : removed enhancement vector mechanism; it verifies the effect of the model using the traditional dual tower model.
The data used in the ablation experiment were the Movielens-1M dataset.Table 6 shows a comparison of the results of the improved model and the conventional model.
Table 6 shows that the two variants of the IHDT model algorithm show different degrees of effect degradation.The interaction-based dual tower model can obtain semantically richer user and item representations.The original dual tower model algorithm cannot obtain the historical interaction behavior of users, thus leading to poor accuracy.In contrast, the higher-order connectivity of GNNs can explore the importance of higher-order neighbors, assign them information on historical interaction behaviors, and enhance user representation and item representation, thus achieving improved accuracy of recommendation results.

Results overview
We comprehensively compared the IHDT algorithm with multiple mainstream recommendation algorithms on the public MovieLens data set.Our conclusions are as follows: (1) Accuracy and diversity.IHDT's precision index (precision) and recall rate (recall) are at the first-tier level among all state-of-the-art models.(2) Response to long-tail demand.Judging from the hit rate indicator, the IHDT algorithm can better discover long-tail items and meet the needs of unpopular preferences.This benefits from the modeling of high-order feature relationships and the mining of implicit associations between users and items, thereby generating personalized recommendation results.(3) Robustness and adaptability.Parameter sensitivity experiments show that IHDT achieves relatively stable performance.The algorithm also performs well across datasets of different sizes.This makes it easy to migrate applications and is very practical.
In summary, the method of combining interactive twin towers with high-order feature representation proposed in this paper shows significant improvement in recommendation performance and application scalability.

Time cost analysis of introducing GCN high-order learning
The typical twin-tower model is mainly composed of a stack of fully connected layers.Due to the space-for-time strategy, it is very efficient in time.The time complexity of DNN can be expressed as O( k i=1 (UD i,in D i,out + ID i,in D i,out ) where U is the number of users, I is the number of items, D i,in andD i,out is the in/out representation dimension of layer i, and L is the number of fully connected layers.After introducing interactive learning and GCN, its time complexity increases to O(L * UD 2 + ID 2 + ED ) , where E is the interactive edge Number, D is the representation dimension.Usually, UD 2 + ID 2 and ED are in the same order of magnitude.

Definition 5
representing a composite relationship.Recommendation Base Graph.Traversing nodes and edges in the i-th meta-path pattern forms a subgraph, G i , and finally merging all the obtained subgraphs to form the recommendation base graph G, i.e., G = G i .

Figure 2 .
Figure 2. Two different higher-order connectivity of u 1 .(a)Higher-order connectivity based on user-item meta-paths of u 1 ; (b) Higher-order connectivity based on user-attribute-item meta-paths of u 1 .

Figure 4 .
Figure 4. Performance under different datasets and recommended list lengths.

Table 4 -
1Training process of IHDT.User-related meta-paths 1 , 2 ,..., Item-related meta-paths 1 , 2 ,..., Input: Heterogeneous Information Network = ( , , , ) 4. for k←1 to K do 5. Aggregate neighbor information by meta-path to get k-order user representation ( ) 6. end for 7. Through a high-order learning mechanism, obtain the final user representation * ∈ ℝ 1× 8.end for 9.for 1 , 2 ,..., do 10.Get the user representation ∈ ℝ 1× under the current meta-path through the interaction learning mechanism 11. for k←1 to K do 12.Aggregate neighbor information through meta-paths to obtain the k-th order item representation ( ) 13. end for 14.Through a high-order learning mechanism, obtain the final user representation * ∈ ℝ 1× 15. end for 16.Calculate the inner product e * Experiments were conducted on two real-world datasets to evaluate the proposed IHDT model.The effectiveness of the IHDT model was evaluated by experiments conducted on two benchmark datasets, MovieLens-1M and MovieLens-10M.The MovieLens dataset was provided by the University of Minnesota's Group Lens Project.The datasets are rated on a five-point scale, from 1 to 5, representing the user's interest in the movie.They are all publicly desensitized and accessible and vary in domain, size, and sparsity.The statistical information of the datasets is shown in Table

Table 3 .
Performance of models with different orders on Movielens-1M dataset.Significant values are in bold.

20 Recall@20N NDCG@20 HR@20
www.nature.com/scientificreports/dimensionality to encode the preferences of users or items.The model's predictive ability decreases if the dimensionality is too low or too high.

Table 4 .
Algorithm performance on Movielens-1M dataset under different representation vector dimensions.Significant values are in bold.

Table 5 .
Algorithm performance on Movielens-1M dataset under different learning rates.Significant values are in bold.Ablation experimentsThe proposed model has two improvements over other dual tower recommendation models, based on constructing a heterogeneous recommendation base graph.

Table 6 .
Validation of ablation experiments on Movielens-1M dataset.Significant values are in bold.