Persistent Dirac for molecular representation

Molecular representations are of fundamental importance for the modeling and analysing molecular systems. The successes in drug design and materials discovery have been greatly contributed by molecular representation models. In this paper, we present a computational framework for molecular representation that is mathematically rigorous and based on the persistent Dirac operator. The properties of the discrete weighted and unweighted Dirac matrix are systematically discussed, and the biological meanings of both homological and non-homological eigenvectors are studied. We also evaluate the impact of various weighting schemes on the weighted Dirac matrix. Additionally, a set of physical persistent attributes that characterize the persistence and variation of spectrum properties of Dirac matrices during a filtration process is proposed to be molecular fingerprints. Our persistent attributes are used to classify molecular configurations of nine different types of organic-inorganic halide perovskites. The combination of persistent attributes with gradient boosting tree model has achieved great success in molecular solvation free energy prediction. The results show that our model is effective in characterizing the molecular structures, demonstrating the power of our molecular representation and featurization approach.


I. INTRODUCTION
Molecular representation and featurization play an essential role in physical as well as in data-driven learning models.Given the rich interplay between structure and function of molecules, an efficient characterization of structural properties is key for extracting functional information.Various quantitative structureactivity/property relationship (QSAR/QSPR) models have been developed to establish explicit linear (or nonlinear) relations between molecular structure and function [1,2].Different molecular fingerprints have been proposed for machine learning and deep learning models in the prediction of molecular functions and properties [3][4][5][6][7][8].However, despite the great progresses, the design of highly efficient descriptors is still the bottleneck for QSAR/QSPR and learning models in the analysis of molecular data from materials, chemistry and biology [1,2].
Graph models [9][10][11][12][13][14][15][16][17][18] are arguably the most widely used tools for molecular representations in molecular dynamics simulation, coarse-grained models, elastic network models, QSAR/QSPR, graph neural networks, etc.In general, a molecule (or a molecular complex) is modeled as a graph with each vertex representing an atom, an amino acid, a domain, or an entire molecule, and edge representing covalent-bond, non-covalent-bond, or more general interaction.However, graphs are designed for the characterization of pairwise interactions.To capture higherorder interactions, topological representations, such as multilayer networks [19], simplicial complexes [20][21][22], hypergraphs [23,24], etc, should be considered.Among them, multilayer networks have been used in the characterization of higher-order dynamics [25][26][27][28] and synchronization dynamics [29][30][31].As a generalization of graphs, simplicial complexes are made not only by 0simplices (nodes) and 1-simplices (edges), but also by higher-dimensional simplices, such as 2-simplices (triangles), 3-simplices (tetrahedron), etc.Hence, higherorder networks and simplicial complexes can describe the many-body interactions among the atoms of a molecule.Hypergraphs are a further generalization of simplicial complexes.An hypergraph is composed of hyperedges, which are formed by a set of vertices.Recently, simplicial complexes and hypergraphs have been used in molecular representations and have allowed improved performance of drug design algorithms, in particular, in the proteinligand binding affinity prediction.
Based on topological representations, molecular descriptors or fingerprints can be generated and further used as features for learning models.Recently, topological data analysis (TDA) [32,33] and combinatorial Hodge theory based molecular descriptors have achieved great success in various steps of drug design, including protein-ligand binding affinity prediction [4,14,[34][35][36][37][38][39], protein stability change upon mutation prediction [40,41], toxicity prediction [42], solvation free energy prediction [43,44], partition coefficient and aqueous solubility [45], and binding pocket detection [46].These models have also demonstrated great advantages over traditional molecular representations in D3R Grand challenge [47,48].Mathematically, the key idea of TDA is extract topological information by investigating persistent homology, which tracks the change of homology generators (i.e., Betti numbers) from simplicial complexes over a filtration process.In particular, the topological invariant Betti numbers can be obtained from the kernel of com-binatorial Hodge Laplacians (HL) matrix.Interestingly, also the Forman Ricci curvature can be obtained via the Bochner-Weitzenböck decomposition of HL matrix [6].The great success of TDA and combinatorial Hodge theory based molecular descriptors in learning models is due to their characterization of structures with intrinsic invariants, including Betti numbers and Ricci curvatures.These intrinsic descriptors are well defined mathematical observables that characterize fundamental topological and geometrical properties of real datasets, thus they have an excellent transferability for learning models.
Inspired by the success of Hodge Laplacian matrix in molecular sciences, here we propose persistent Dirac based molecular representation and fingerprint.The discrete Dirac operator [49][50][51][52][53][54][55] is a first-order differential operator which can be interpreted as the square root of Hodge Laplace operator.This operator has been developed on graphs and simplicial complexes and used in TDA and for investigating dynamics of topological signals [50,[56][57][58].Moreover, the persistent Dirac model can be used in the quantum algorithm of persistent homology [52,53,59].Here we present a rigorous mathematical theory for persistent Dirac through the commutative diagram of discrete Dirac operator over a filtration process.The commutative diagram is similar to the ones in persistent spectral graph [5,60], persistent Hodge Laplacian [61], and persistent sheaf Laplacian [61,62].Further, we develop a series of persistent attributes from persistent Dirac, and use them as descriptors to characterize molecular structures.
Our work starts with a systematically study of the spectrum of the discrete Dirac matrices.In particular, we identify the geometric and topological properties of both non-homology and homology eigenvectors for molecular structures.We generalize these results to weighted simplicial complexes on top of which the weighted Dirac operator [63] is carefully defined.In particular, here we analyse the influence of weighting schemes on the spectral properties of molecular structures.The persistent Dirac is then introduced and is employed for the clustering of molecular configurations from the molecular dynamic simulations of nine types of organic-inorganic halide perovskites (OIHP).By the comparison with several existing models, we show that our model is highly efficient in clustering the structure configurations.This demonstrates the great potential of our persistent Diracbased fingerprints in molecular representation and featurization.
The paper is organized as follows.Sec.II is devoted for Hodge Laplacian model.It covers general concepts including simplicial complexes, chain groups, boundary operators, and Hodge Laplacian.In Sec.III, persistent Dirac model is present.The eigenspectrum information for (weighted) Dirac matrix and persistent attributes from persistent Dirac are discussed in detailed.Sec.IV is for the application of the persistent Dirac based fingerprints.The paper ends with an conclusion.

A. Simplicial Complex
Generally speaking, a simplicial complex can be viewed as a higher-dimensional generalization of graphs.A pdimensional simplicial complex is formed by simplices of dimension up to p. Every p dimensional simplex consists of a set of p + 1 vertices and this set can be viewed geometrically as a point (0-simplex), an edge (1-simplex), a triangle (2-simplex), a tetrahedron (3-simplex), etc.
More precisely, a p-simplex The i-dimensional face of p-dimensional simplex σ p (indicated with i < p) is the convex hull formed by i + 1 vertices belonging to the set of p + 1 points The simplices are basic components of a simplicial complex.
A simplicial complex K is a finite set of simplices that satisfy two essential conditions: • Any face of a simplex from K is also in K.
• The intersection of any two simplices in K is either empty or formed by shared faces.
Here and in the following we indicate with n p the number of p-simplices belonging to the simplicial complex K.The most commonly used simplical complexes include Čech complex, Vietoris-Rips complex, Alpha complex, Cubical complex, Morse complex, etc. [64].Two p-dimensional simplices σ 1 and σ 2 in a simplicial complex K, are simplex neighbors if (i) σ 1 and σ 2 share a (p + 1)-simplex µ, that is, there exists a µ in K such that µ > σ 1 and µ > σ 2 .
If either condition is satisfied, as long as both conditions do not hold at the same time, σ 1 and σ 2 are called parallel simplex neighbors.Here σ 1 and σ 2 are called upper adjacent neighbors and denoted as σ 1 σ 2 , if they satisfy condition (i).They are lower adjacent neighbors and denoted as σ 1 σ 2 if they satisfy condition (ii).In addition, the d-skeleton of a p-dimensional simplicial complex is the simplicial complex consisting of simplices up to dimension d, where 0 ≤ d ≤ p.The 1-skeleton of a simplicial complex is always the graph of simplicial complex.

B. Homology
In homology, a p-dimensional oriented simplex σ p is the set of ordered p Similarly, this orientation can be written for higher-order simplices in the following way, where α(π) refers to the parity of the permutation π.In this paper, we consider the orientation induced by node labels, i.e. for every simplex in a simplicial complex, we assign a positive orientation to the one provided by the increasing set of node labels.
For an oriented simplicial complex K, its p-dimensional chain group C p (K) is composed by linear combination of positively oriented p-simplices in K.
indicate the generic positively oriented p-simplex σ p ∈ K.We notice that the set of simplices σ p constitute a basis for the p-dimensional chains C p (K). Therefore any pchain f 1 ∈ C p (K) can be written in a unique way as The weighted boundary operator ∂ p : C p → C p−1 can be determined by its action on any given σ p ∈ K: Here a p is a constant in R + dependent on p and the boundary of p-simplex is made of (p − 1)-simplices , where vi means that v i has been removed from the sequence v 0 , • • • , v p .It is also wellknown that ∂ p−1 ∂ p = 0.The unweighted boundary operator can be obtained by setting a p = 1.In other words, the unweighted boundary operator ∂ p : C p → C p−1 for a given σ p ∈ K is defined as For an oriented simplicial complex K, its two oriented p-dimensional simplices σ 1 and σ 2 are similarly oriented and denoted as σ 1 ∼ σ 2 , if they are lower adjacent and have the same sign on the common lower (p − 1)-simplex.Two simplex σ 1 and σ 2 are dissimilarly oriented and denoted as σ 1 σ 2 , if they are lower adjacent but have different signs on the common lower (p − 1)-simplex.
The p-th cycle group Z p is defined as, and p-th boundary group B p is, The p-th homology group is defined as H p = Z p /B p .Its rank is p-th Betti number that satisfies With the boundary operators, we have chain complexes The adjoint of ∂ p , which is It is used in the weighted Hodge Laplacian.

C. Weighted Hodge Laplacian and Hodge Decomposition
The p-dimensional weighted Hodge Laplacian ∆ p : C p → C p is defined as follows: The special case where p = 0 is the well-known graph Laplacian.
Computationally, the information for weighted boundary operators acting from finite dimensional chain groups C p to C p−1 can be stored efficiently in matrix representations.As matrix representations, the weighted boundary operators and its adjoint satisfies ∂ p = ∂ * p .More specifically, let n p−1 and n p be the number of (p− 1)-simplices and p-simplices respectively in a simplicial complex K.The n p−1 × n p weighted boundary matrix B p has entries defined as follows: where 1 ≤ i ≤ n p−1 and 1 ≤ j ≤ n p .Here, σ p−1 i < σ p j represents the i-th (p − 1)-simplex σ p−1 i is a face of j-th p-simplex σ p j and σ p−1 i ∼ σ p j indicates the coefficient of Since the unweighted boundary operator ∂ p = 1 ap ∂ p , note that an unweighted boundary matrix can be similarly written as Using the weighted boundary matrices, the lower and upper weighted Hodge Laplacians can be defined as L For p > 0 the matrix elements of the Hodge Laplacian L up p are given by , have various interesting properties as follows (see [65] or Appendix A for proofs).
The matrix elements of the Hodge Laplacians L p with p = 0 are given by while the matrix elements for p > 0 can be expressed as It follows from Eq. ( 2) that the lower and upper unweighted Hodge Laplacians can be written as L down for p = 0 while for p > 0 the matrix elements of the Hodge Laplacian are given by It is well-known that λ is a non-zero eigenvalue of L p if and only if λ is an non-zero eigenvalue of L down p or L up p .The multiplicity of the zero eigenvalues of L p corresponds to the pth Betti number as follows, where β p is also the rank H p (see [65] or Appendix B).
Further, dim ker L down p can be written as: This means that the number of zero eigenvalues in L down p is equal to the sum of rank B p+1 and β p .Furthermore, since Here, (im B p+1 ) ⊥ is the orthogonal complement of im B p+1 .In fact, (im B p+1 ) ⊥ = ker B p+1 .This is true because from (v), ker B p = (im B p ) ⊥ (see Appendix A for the proof).Hence, this gives In other words, we have This means that for every v ∈ ker L p , This is exactly the Hodge decomposition which states that a p-th chain group C p of a simplicial complex K admits the following orthogonal direct sum decomposition: The vector space of edge flows C 1 admits the following orthogonal sum decomposition in Helmholtz-Hodge Decomposition: It is worth mentioning that such flows have also been extended to five component decompositions with edge and face vector fields [66], applied to the protein B-factor prediction problems via Hodge theory [8] and also in de Rham-Hodge biomolecular data analysis [9,67].

III. PERSISTENT DIRAC
A. Discrete Dirac models a. Weighted Dirac matrix Recently, weighted Dirac matrices have been proposed based on a weighted simplicial complex [63].For a d-dimensional weighted simplicial complex K, let us define the n p × n p metric matrix G p (0 ≤ p ≤ d) to be a diagonal matrix with positive entries.For any two p-chains f 1 = np i=1 c i σ i and f 2 = np i=0 d i σ i in C p , the matrix G p can be used to define the weighted inner product Recall from Eq. ( 2) that the weighted boundary operator can be represented by a matrix B p = a p B p where B p is the unweighted boundary matrix.If a p = 1, then B p reduces to the adjoint operator of B p .Formally, for any p-chain f and any (p − 1)-chain g, the adjoint operator From the inner product relation ( 5), an explicit expression of B * p can be deduced in terms of B p and the matrices G p .Based on the weighted inner product definition (4), this gives Since the expression is true for any arbitrary f and g, this implies Hence, the following becomes an explicit expression for the adjoint operator B * p : Here B * p [65] is the adjoint of the weighted boundary operator [10,68].It is important to note that if the metric matrices G p are the identity matrices, the above expression then reduces to the transpose of the boundary operator multiplied by the constant a p , This also means the transpose of adjoint operator, i.e. (B * p ) , is equal to B p only if G p are identity matrices.To see this, apply the transpose to both sides of Eq. ( 6) and obtain the expression The matrices (B * p ) and B * p can then be used to construct the following weighted Dirac matrix (9).
For a simplicial complex K with n p × n p−1 adjoint operators B * p where n p−1 is the number of (p − 1)-simplices and n p is the number of p-simplices in K, the weighted Dirac matrix D p is In particular, we set a p = (p+1) −1/2 for all p up to the order of the simplicial complex and consider the matrices B * p and (B * p ) in Eq. ( 7) and ( 8) respectively.For p = 2, the weighted Dirac matrix from (9) becomes This definition can be extended easily to higher dimensions.Note that Note that this definition of weighted Dirac is self-adjoint and with eigenvalues smaller than or equal to one.The square of the weighted Dirac also forms a diagonal block of metric Hodge Laplacian matrices where the metric Hodge Laplacian matrices are defined as Depending on the matrices G p , the weighted Dirac matrix may not always be symmetric, despite its eigenspectrum can be shown to be always real (see Appendix H).
For the rest of the paper, the metric matrices G p shall be defined with each metric value for a simplex σ p to be dependent on its (p + 1)-dimensional cofaces in the following way [63]: Here, w σ p > 0 is a positive weight on p-simplex σ p , which can be related to physical, chemical and biological properties.
b. Discrete Dirac matrix With the weighted Dirac matrix D p , a discrete Dirac matrix is simply the special case of D p when G p are identity matrices and a p = 1 for all p ≥ 1.
Previously, a general Dirac matrix has been defined as [49,69,70] where z ∈ C such that |z| = 1.Since |z| = 1, the typical values of z occurs when z = z = 1 or z = −z = i.In general, the parameter z ∈ C extends the real eigenvectors of D p (z) to C while the eigenvalue remains unchanged.By taking the square of the Dirac operator, we have which implies that the eigenvalues of diagonal block realvalued Hodge-Laplacian matrices will also be the eigenvalues of D 2 p (z).Since the Hodge-Laplacians are positive semi-definite symmetric matrices, the eigenvalues of D 2 p (z) are non-negative as well.However, eigenvectors from the Dirac matrix may contain complex numbers.
For a simplicial complex K with n p−1 × n p boundary matrices B p where n p−1 is the number of (p−1)-simplices and n p is the number of p-simplices in K, the discrete Dirac matrix D p [70] is It is of size S1 shows a simple construction of discrete Dirac matrices (10) for a triangle and a tetrahedron.In Fig. S1(a), the triangle is a 2simplex and hence the largest Dirac operator is D 1 .On the other hand, the tetrahedron in Fig. S1(b) is a 3simplex and thus the largest Dirac operator is D 2 .
Note that by taking the square of D p , one would obtain a matrix with diagonal blocks of unweighted combinatorial Hodge Laplacians as shown below.
, where the unweighted Hodge Laplacian L p , is given by In our case, the last term contains only L down p+1 .
Recall that B p+1 B p+1 is also known as the lower Hodge Laplacian L down p+1 while B p+2 B p+2 is known as the upper Hodge Laplacian L up p+1 .
B. Spectrum of the discrete Dirac operator a. Spectral of Dirac matrix Let Q p be the block diagonal matrix , where I np denotes an n p × n p identity matrix and Q p satisfies The Dirac matrix satisfies the supersymmetry condition D p Q p = −Q p D p .That also means that the anticommutator between the Dirac matrix D p and block diagonal matrix Q p vanishes.Further, which implies that Q p v is an eigenvector associated with the eigenvalue −λ.Essentially, the above shows that the Dirac operator of a simplicial complex satisfies D p v = λv where λ is the eigenvalue associated with the eigenvector v if and only if where Q p v is the eigenvector associated to the eigenvalue −λ.
Since λ (resp.−λ) is an eigenvalue of D p with corresponding eigenvector v (resp.Q p v), then for any positive integer s, λ s (resp.(−λ) s ) is an eigenvalue of D s p with corresponding eigenvector v (resp.Q p v).The detailed proof is in Appendix B. Now, we consider the relationship between the eigenspectrum of D 2 p and D p .For the case of zero eigenvalues, It is easy to derive the above cases by considering (D Here, there are two possible cases since by (11), either λ or −λ is the eigenvalue of D p .If −λ is the eigenvalue of D p , then (D p + λI)w = 0 for some non-zero eigenvector w.This implies that (D p − λI)w = 0, otherwise it contradicts (D p + λI)w = 0. Hence, this means that Eq. ( 12) can be rewritten as where w = (D p − λI)v is a non-zero eigenvector for D p with corresponding eigenvalue −λ.
Similarly, if λ is an eigenvalue of D p , then (D p − λI)w = 0 for some non-zero eigenvector w.This implies that (D p + λI)w = 0, otherwise it contradicts (D p − λI)w = 0. Therefore, Eq. ( 12) can be rewritten as (D p − λI)w = 0, where w = (D p + λI)v is a non-zero eigenvector for D p with corresponding eigenvalue λ.
This leads us to the following relations connecting D p , D 2 p and L k (0 , In other words, v is a vector consisting of block vectors w k for 0 ≤ k ≤ p + 1.This means that for every 0 ≤ k ≤ p, w k ∈ ker L k .In the case where k = p + 1, w p+1 ∈ ker L down p+1 .We have, Note that for w p+1 , it is the eigenvector from the kernel of L down p+1 .Hence, the kernel of D 2 p can be decomposed into a direct sum of kernels of L k from k = 0 to k = p + 1: Further, we have where H k refers to the direct sum of homology groups. Therefore, the eigenvectors of D p reveal both k-th homology and k-th non-homology information within the structural data for all 0 ≤ k ≤ p + 1.Instead of eigendecomposing HL matrices for all 0 ≤ k ≤ p + 1, one can simply eigendecompose D p to obtain all of the eigenspectrums.As the number of zero eigenvalues of L down p+1 is the rank B p+2 plus the (p + 1)-th Betti number β p+1 , the multiplicity of zero eigenvalues in D p is the rank B p+2 plus the total sum of all the Betti numbers from dimension 0 to p + 1.That is, Mathematically, the eigenvectors corresponding to the zero eigenvalues are known as homology generators while those from non-zero eigenvalues are the non-homology generators.Both of them can be used in structural clustering.More specifically, the homology generators can be used for clustering structures based on their loop or circle components, while non-homology generators are related to the spectral clustering, in which communities and clusters are based on their distances.Fig. 1 demonstrates the structural clustering with homology and non-homology generators for a protein (PDBID: 1AXC).We only consider the C α atoms in structure.A Vietoris Rips complex is constructed by using a cutoff distance of 10 Å.The Dirac matrix D 1 and its eigenvalues and eigenvectors are calculated.As the non-zero eigenvalues of D 1 come in pairs, it suffices to consider the eigenvectors corresponding to the positive eigenvalues.For all the non-negative eigenvalues of D 1 , the eigenvectors are arranged in ascending order according to its corresponding eigenvalues.
Fig. 1(a) illustrates the loop/circle-based clustering using four one-dimensional (1D) homology generators.Note that these 1D homology generators w 1 are taken from the homology generators of D 1 (with eigenvalues 0).More specifically, these 1D homology generators are defined by the 1-simplices.In Fig. 1(a), a thick edge with dark blue color indicates large magnitude of the value, while a thinner edge with light blue color means the corresponding 1D homology generator has a value with small magnitude on this 1-simplex.It can be seen that edges with large magnitudes are the 1-simplices that form circles or loops.Each 1D homology generator forms an individual loop or circle.In this way, 1D homology generators can be used for loop/circle-based clustering of molecular structures.
Fig. 1(b) illustrates the spectral clustering using four zero-dimensional (0D) non-homology generators.The four 0D non-homology generators w 0 are taken from the non-homology generators of D 1 with the four smallest positive eigenvalues.Note that these 0D nonhomology generators are defined on nodes (0-simplices).
In Fig. 1(b), nodes with negative values are colored in red while nodes with positive values are of blue color.It can be seen that the nodes in the structure can be naturally clustered into groups based on the signs of these 0D non-homology generators.This approach is known as spectral clustering and widely used in data analysis.It should be noticed that using the higher order Dirac matrices, we can cluster not only nodes (0-simplices), but also higher dimensional simplices.
b. Spectral of weighted Dirac matrix The weighted Dirac matrix has different spectral properties based on the different weighting schemes.Fig. 2 illustrates the spectrum of the weighted Dirac matrix defined from the guanine molecule structure (using all-atom representation).We construct an unweighted Vietoris Rips complex using a cutoff distance of 1.2 Å.The discrete Dirac matrix D 1 can be computed using (10).The discrete Dirac matrix D 1 is eigendecomposed to obtain its eigenvalues and eigenvectors.Moreover, a weighted simplicial complex is constructed by assigning simplex σ with different weight w σ .The metric matrices G p are computed and weighted Dirac matrix D 1 can then be constructed.Fig. 2 shows the homology generators and Fiedler vector for an unweighted simplicial complex and three different weighted simplicial complexes.Among the three weighted simplicial complexes, Fig. 2(b) shows a weighted simplicial complex where all weights w σ are equal to 1. Two modified weighted simplicial complexes are constructed by modifying the weights of edge e 1 ranging from 10 and 0.01 with all the other weights kept unchanged.With the same underlying simplicial complex, they share the same three homology generators, one 1D component and two 2D circles.Fig. 2 shows the corresponding eigenvectors for these homology generators.The magnitude of the eigenvectors are represented by the thickness and darkness.An edge (or vertex) with thicker lines and darker blue color indicates a larger magnitude.
In general, the weight of a simplex has an inverse effect on the corresponding element of the homology eigenvectors (i.e., homology generators).When the simplex has a smaller weight, the corresponding element of the homology eigenvectors has larger magnitude.Similar patterns also appear in non-homology generators.Fig. 2 illustrates the Fiedler vectors (i.e., eigenvector corresponding to the first smallest non-zero eigenvalue) of the nonweighted and weighted simplicial complexes.Simplices are colored in red/blue if the element of the nonhomology eigenvectors has value positive/negative.The thickness of simplices represents the magnitude of their values in non-homology generators.It can be seen clearly that the weight of a simplex has an inverse effect on its magnitude of the values of eigenvectors.sheaf Laplacians have been developed [62,71].Their essential idea is to explore the persistence of spectral information during the filtration process.Here we develop the rigorous mathematical framework for persistent Dirac.
Let (R, ≤) be a category of real numbers with morphisms given by a → b for any a ≤ b.A functor F : (R, ≤) → Simp gives a filtration of simplicial complexes of finite type, i.e.F maps from a category of real numbers to a category of simplicial complexes of finite type.For any two real numbers a ≤ b, the functor F satisfies the inclusion which induces a morphism of chain complexes  p can then be written as follows. .
The maps and spaces are also illustrated in the diagram below Further, the p-th persistent Hodge Laplacian can be defined as b.Persistent attributes For any Dirac matrix, its non-zero eigenvalues come in pairs.Each pair contains one negative eigenvalue and one positive counterpart.For the set of all its positive eigenvalues, a Dirac Zeta function can be defined as follows [69], is the m-th spectral moments of DO matrices and ζ(−1) is the Laplacian graph energy.Another way to define Dirac Zeta function is to consider its negative eigenvalues by replacing the λ −s j with (1 + e −iπs )|λ j | −s .Here λ j can be negative.For instance, ζ(2) = 2 j=1 λ −2 j .Furthermore, the q-Dirac complexity of a simplicial complex K can be defined as The case where q = 1 is introduced in [70].c 1 (D p ) is equal to the product of all non-zero eigenvalues in spectra of D p since the non-zero eigenvalues come in pairs.The number of non-zero eigenvalues pairs in D p is the (signless) Euler-Poincaré number defined as follows, = 1 2 where n k is the number of k-simplices and dim ker D p is the multiplicity of zero eigenvalues of D p .Using Eq. ( 14), can be computed as follows: Interestingly, the spanning tree number, introduced as one of the spectral indices in molecular descriptors [1], can be written as

Alternatively, t(D
To summarize, we consolidate and consider a set of statistical and combinatorial attributes as molecular descriptors for each given set of positive eigenvalues {λ 1 , λ 2 , • • • , λ } where is the number of non-zero eigenvalue pairs: also known as the Fiedler value. • (Signless) Euler-Poincaré Number (number of nonzero eigenvalue pairs) • Quasi-Wiener Index (n + 1)ζ(1).
• Spanning Tree Number t(D p ).
Fig. 3(c) shows the persistent multiplicity, persistent mean, persistent standard deviation and persistent (signless) Euler-Poincaré number for the filtration of guanine molecule.Further information such as the persistent multiplicities of L k (0 ≤ k ≤ 2) and L down k (1 ≤ k ≤ 3) can be found in Appendix F. Recall that the persistent multiplicity is equivalent to the persistent Betti number.Here, the persistent multiplicity and persistent (signless) Euler-Poincaré number of D p can be quantitatively analysed by comparing the persistent multiplicity of L down p+1 and the kth persistent Betti numbers for 0 ≤ k ≤ p.It can be seen that these persistent attributes change with the filtration value.Each variation of the persistent attribute indicates a certain change in the simplicial complex.
At the very start of the filtration, there are 16 isolated atoms which means that there are 16 connected components.Hence, the persistent multiplicity of L 0 is 16 since β 0 = 16.As all other Betti numbers are zero and there are no higher order simplices present at the start of the filtration, D 0 , D 1 and D 2 are all-zero 16 × 16 matrices.Therefore, the persistent multiplicity of D 0 , D 1 and D 2 are all equal to 16.Using Eq. ( 15), the persistent (signless) Euler-Poincaré number is zero.
As filtration parameter f increases, the size of D 0 , D 1 and D 2 matrix increases as well.This differs from the Hodge Laplacian matrix L 0 , whose size remains unchanged.
At filtration size 4.7 Å, a complete simplicial complex is achieved, i.e., any p + 1 vertices will form a p-simplex.When this happens, the size of D p no longer increases any further.Here, the size of D 0 , D 1 and D 2 are distinct.The size of D 0 is 136 × 136 since 16×15 2 (no. of 1-simplices) + 16 (no. of 0-simplices) = 136.Similarly, the sizes of D 1 and D 2 are 696 × 696 and 2516 × 2516 respectively.Furthermore, the persistent multiplicity of D 0 , D 1 and D 2 are also distinct.Using Eq. ( 14), the persistent multiplicity of D 0 is 105 (persistent multiplicity of L down ) and 1 (0-dimensional persistent Betti number) which sums up to 106.Since the persistent multiplicity of L 1 (see Appendix F) is zero, then Eq. ( 3) implies that the rank of B 2 is 105.In addition, the persistent multiplicity of D 1 and D 2 are 456 and 1366 respectively.Based on the non-zero eigenvalues, the persistent (signless) Euler-Poincaré number of D 0 , D 1 and D 2 is 15, 120 and 575 due to Eq. (15).

IV. PERSISTENT DIRAC FOR MOLECULAR STRUCTURE REPRESENTATION
Recently, a series of persistent models, including persistent homology, persistent spectral, persistent Ricci curvature, and persistent Laplacian, have demonstrated their great power in molecular representations [3,6,38,72].They have consistently outperformed traditional graph-based models in various tasks of drug design.Here we study the representation capability of Persistent Dirac in molecular data analysis.
We consider the Organic-inorganic halide perovskite (OIHP) dataset.More specifically, three kinds of Methylammonium lead halides (MAPbX 3 , X=Cl, Br, I), i.e., orthorhombic, tetragonal, and cubic phase of MAPbX 3 are used.For each kind, there are 3 types of X atoms, including chlorine Cl, bromine Br and iodine I.The molecular dynamic simulations are systematically carried out on these molecular structures with the initial configurations based on pre-defined crystal cell parameters.For each MAPbX 3 structure, 1000 configurations are equally sampled from its MD simulation trajectory and the last 500 configurations, which represent stable structures, are selected for the test of our persistent Dirac model.Essentially, a total of 4500 configurations from the 9 types of MAPbX 3 structures are mixed together and our persistent Dirac based molecular fingerprint is used in the clustering of these configurations.
Computationally, our persistent Dirac is generated based on Alpha complex and the filtration parameter is the distance.More specifically, for each frame, an Alpha complex is constructed from its coordinate data and applied in a filtration process.The Dirac matrices D 0 and D 1 are computed from 1 Å to 6.5 Å with stepsize 0.25 Å throughout the filtration process.Hence, the eigenvalues of D 0 and D 1 each contribute to 12 statistical attributes for 23 timesteps per frame.The feature size sums up to 552.By considering with and without hydrogen atoms, the total feature size for persistent Dirac is 552 × 2 = 1104.Likewise, for coordinate-only model, the input features are xyz-coordinates of all the atoms.Since each structure consists of 553 atoms, the feature size is of 553 × 3 = 1659.For the discrete Dirac model, the feature size is 552.The clustering of these MAPbX 3 structures is then studied using unsupervised learning models, in particular t-distributed stochastic neighbor embedding (t-SNE).where L up p+1 = 0 np+1×np+1 and L down 0 = 0 n0×n0 .For convenience, we simply denote diag(L up 0 p ) up and s(D 2 p ) down be the spectrum of the upper and lower D 2 p .Following ( 16) and ( 17), the multiplicity of zero eigenvalues for (i) s(D 2 p ) up can be computed as

DERIVATION OF THE REAL SPECTRUM OF THE WEIGHTED DIRAC OPERATOR
By setting G n as identity matrices, we obtain the Dirac matrices whose eigenvalues has been shown to be always real [50].In [50], the special case is discussed and has been applied to signal processing by proposing the use of topological spinors obtained from eigenspectrum of Dirac matrices.Essentially, an n-dimensional topological spinor s can be written as where C n is the space of all n-dimensional topological spinors.Here, the n-dimensional topological spinor s is a direct sum of block vectors (signals) s k defined for ksimplices, 0 ≤ k ≤ n.Now, we shall provide a similar treatment to the weighted Dirac matrices which may or may not be symmetric but can be shown to always have real eigenvalues.To be concrete we will focus on the case in which the simplicial complex is two dimensional, i.e. formed by nodes, links and triangles.Extension of these results to higher-order Dirac operators is straightforward.
Recall from the definition of weighted Dirac matrix that D 1 is written as We shall write D 1 as D [0] + D [1] where and Note that D This means that the weighted Dirac matrix D 1 admits the following Dirac decomposition [50]: where ker D 1 = ker L [0] ⊕ ker L [1] ⊕ ker L [2] .
The above Dirac decomposition implies that the nonzero eigenvectors of D 1 are either non-zero eigenvectors of D [0] (corresponding to an eigenvalue λ 0 ) or non zero eigenvectors of D [1] (corresponding to an eigenvalue λ 1 ).Here, define the matrix Φ of the eigenvectors of D 1 as where Φ n is the matrix of the eigenvectors φ n ∈ im(D [n] ) with n ∈ {0, 1} and Φ harm is the matrix of eigenvectors forming a basis for ker(D 1 ).Now, denote u 0 as the eigenvector of L [0] and v 0 as the eigenvector of L down [1] corresponding to the same non zero eigenvalue Λ 0 , i.e. satisfying the relations and similarly, we have v 1 as the eigenvector of L up [1] and z 1 as the eigenvector of L down [2] corresponding to a same non zero eigenvalue Λ 1 , i.e. satisfying the relations with u 0 , v 0 , v 1 and z 1 being eigenvectors normalized to one.Using a notation from Appendix G, we can also write D ).This implies that eigenvectors D [0] and the eigenvectors of D [1] takes the form where U 0 , V 0 , V 1 , Z 1 are the matrices formed by vectors proportional to the eigenvectors u 0 , v 0 , v 1 and z 1 respectively.In particular we have that the eigenvectors φ n with n ∈ {0, 1} can be written as Let us indicate with u harm , v harm and z harm the eigenvectors corresponding to the zero eigenvalue of L [0] , L [1] and L [2] respectively, i.e. satisfying L [0] u harm = 0, L [1] v harm = (L up [1] + L down [1] )v harm = 0, L [2] z harm = 0.
We have that where U harm , V harm and Z harm are the matrices of eigenvectors u harm , v harm and z harm respectively.The weighted Dirac matrix has eigenvalues which can be null, positive or negative.The positive part of the spectrum is given by the square root of the eigenvectors of the (normalized) Hodge Laplacian and for each positive eigenvector there is a negative eigenvector with the same absolute value.The eigenvectors of the weighted Dirac matrix are formed by the direct sum of the eigenvectors of the weighted Hodge Laplacians.The normalized weighted Dirac matrix has positive, zero and negative eigenvalues λ n that have absolute value smaller or equal to one [50] |λ n | ≤ 1.
It is easy to show that the eigenvalues Λ n are equal to the eigenvalues Λn , i.e.
where µ n is the singular value of Bn and that It follows that the spectrum of the normalized Dirac operator is real although the operator is not symmetric with and the eigenvectors φ ± 0 and φ ± 1 are given by where ũ0 , ṽ0 are the left and the right singular vectors of B1 respectively and where ṽ1 , z1 are the left and the right singular vectors of B2 respectively.

=
B p B p and L up p = B p+1 B p+1 respectively.More specifically, the entries of L down p (p > 0) are as follows, (i) ker B p = ker L down p .(ii) ker B p = ker L up p−1 .(iii) λ is a non-zero eigenvalue of L down p with corresponding eigenvector v if and only if λ is a non-zero eigenvalue of L up p−1 with corresponding eigenvector B p v. Hence, L up p−1 and L down p always have the same non-zero eigenvalues.(iv) v ∈ ker L down p if and only if p = B p B p and L up p = B p+1 B p+1 respectively.Hence, the p th unweighted combinatorial Laplacian L p = L down p + L up p have elements given by Hence, D p shares the same eigenvectors as D 2 p for zero eigenvalues.If λ 2 is a non-zero eigenvalue of D 2 p with eigenvector v, then we have the following possible cases for D p : (i) λ is an eigenvalue of D p with eigenvector w = (D p + λI)v.i.e. (D p − λI)w = 0. (ii) −λ is an eigenvalue of D p with eigenvector w = (D p − λI)v.i.e. (D p + λI)w = 0.

C
FIG. 1. Illustration of loop/circle-based clustering using four one-dimensional (1D) homology generators (a) and spectral clustering using four zero-dimensional (0D) non-homology generators (b).The Dirac matrices D1 are generated from the Vietoris Rips complex of the Cα atoms in PDBID: 1AXC at 10 Å.(a) Here 1D homology generators w 1 are taken from the homology generators of D1 with eigenvalues as 0. A thick edge with dark blue color indicates large magnitude of the value, while a thinner edge with light blue color means the corresponding the 1D homology generator has a value with small magnitude on this 1-simplex.Each 1D homology generator forms an individual loop or circle.(b) The four 0D non-homology generators w 0 are taken from the non-homology generators of D1 with the four smallest positive eigenvalues.Note that these 0D nonhomology generators are defined on nodes (0-simplices).Nodes with negative values are colored in red while nodes with positive values are of blue color.It can be seen that the nodes in the structure can be naturally clustered into groups based on the signs of these 0D non-homology generators.

a∈RF
(a) and C * = C * (F(∞), R).Note that C * can be endowed with an innerproduct •, • .Further, a subspace C * (F(a), R) would inherit the inner product structure of C * and a boundary operator given by the restriction ∂ a p = ∂ p | Cp(F (a),R) : C p (F(a), R) → C p−1 (F(a), R).Here ∂ * is the boundary operator of C * .For convenience, we shall write C a p = C p (F(a), R).For a pair of simplicial complexes F(a) ⊂ F(b), we consider the inclusion map ι : F(a) → F(b).For p ∈ N, the subspace C a,b p := {x ∈ C b p : ∂ b p (x) ∈ C a p−1 } ⊆ C b p , which consists of the p-chains in C b p such that their images are under the boundary operator ∂ b p in the subspace C a p−1 of C b p−1 .Also, we have a linear operator ∂ a,b p = ∂ b p | C a,b p : C a,b p → C a p−1 , which induces an adjoint operator (∂ a,b p ) * : C a p−1 → C a,b p with respect to the inner product •, • .Let n a,b p := dim(C a,b p ). Then following commutative diagram is thus induced by ι.
FIG. 2. Illustration of three homology generators and Fiedler vector from (a): discrete Dirac matrix and (b)-(c): weighted Dirac matrix (from weighted simplicial complexes).For the discrete Dirac matrix, the three homology generators represents one 1D component and two 2D circles.By assigning simplex σ with different weight wσs, three weighted simplicial complexes are constructed in (b) and (c).In Fig. 2(b), the weighted simplicial complex consists of all weights wσ equal to 1. Fig. 2(c) shows two weighted simplicial complexes by changing the weights of edge e1 from 1 to 10 and 0.01 while the rest of weights remain unchanged.The magnitude of the homology generators are influenced by these weights and are reflected based on their thickness and darkness.For the homology generators, the edges (or vertices) are thicker and in darker blue color if they have a larger magnitude.Similarly, the edges and vertices are colored in red/blue if their elements in the Fiedler vectors have positive/negative sign.The magnitudes of their values in Fiedler vectors are represented by the thickness of edges and size of vertices.
Similarly, the matrices (B a,b p ) B a,b p and B a,b p+1 (B a,b p+1 ) are the p-th persistent lower and upper Hodge Laplacians (L down p+1 ) a,b and (L up p+1 ) a,b respectively.Based on (13), the following result shows that the nullity of p-th persistent Dirac operator equals to the rank of B p+2 plus the sum of k-th persistent Betti numbers, where 0 ≤ k ≤ p + 1. ker D a,b p = ker(D a,b p ) 2 = ker(L down p+1 ) a,b ⊕ ) a,b refers to the direct sum of (a, b)persistent homology groups.The (a, b)-persistent homology groups characterizes the homology generators that are born at time a and survive to time b.Fig. 3 illustrates the persistent Dirac analysis of the guanine molecule (using all-atom representation).More specifically, Fig. 3(a) shows the Vietoris-Rips complex of the guanine molecule when filtration parameter f = 0.0 Å, 0.75 Å, 1.2 Å, 1.5 Å and 1.8 Å.In particular, triangles first appear around 1.2 Å and tetrahedron starts to appear at 1.5 Å. Fig. 3(b) shows the corresponding Dirac matrix D 2 .The size of the Dirac matrix D 2 consistently increases during the filtration process.

FIG. 3 .
FIG. 3. Illustration of the filtration process of the guanine molecule (a), its associated Dirac matrices (b), and persistent attributes (c).In the filtration process, more simplices are formed in simplicial complex and the size of Dirac matrix increases.The eigenspectrum of Dirac matrices changes in the filtration process.The changes in eigenspectrum are being converted into a series of 12 statistical and combinatorial attributes.One of the statistical attribute, persistent multiplicity, provides quantitative analysis to the change in zero eigenvalues of Dirac matrices while the remaining 11 persistent attributes are derived from the non-zero eigenvalues.

Fig. 4
FIG. 4. The clustering of 9 types of OIHP molecular dynamics (MD) trajectories.Three feature generation schemes are considered, including (a) XYZ-coordinates, (b) Discrete Dirac at 3.5 Å and (c) Persistent Dirac.Each trajectory contains 1000 configurations and t-SNE model is used for clustering (of the last 500 configurations at equilibrium).The x-axis and y-axis are the two principal components obtained from the t-SNE model.