PGxCorpus, a manually annotated corpus for pharmacogenomics

Pharmacogenomics (PGx) studies how individual gene variations impact drug response phenotypes, which makes PGx-related knowledge a key component towards precision medicine. A significant part of the state-of-the-art knowledge in PGx is accumulated in scientific publications, where it is hardly reusable by humans or software. Natural language processing techniques have been developed to guide experts who curate this amount of knowledge. But existing works are limited by the absence of a high quality annotated corpus focusing on PGx domain. In particular, this absence restricts the use of supervised machine learning. This article introduces PGxCorpus, a manually annotated corpus, designed to fill this gap and to enable the automatic extraction of PGx relationships from text. It comprises 945 sentences from 911 PubMed abstracts, annotated with PGx entities of interest (mainly gene variations, genes, drugs and phenotypes), and relationships between those. In this article, we present the corpus itself, its construction and a baseline experiment that illustrates how it may be leveraged to synthesize and summarize PGx knowledge.


Sentence representation with word embeddings
Both our models for NER and RE are fed with word embeddings (i.e., continuous vectors) of dimension d w , along with extra entity embeddings of size d e . RE is fed with an additional nested entity embeddings of size d n .
Regarding word embeddings, given a sentence of N words, w 1 , w 2 , . . . , w N , each word w i ∈ W is embedded in a d w -dimensional vector space by applying a lookup-table operation: LT W (w i ) = W wi , where the matrix W ∈ R dw×|W| represents the parameters to be trained in this lookup-table layer. The dictionary W is composed of all the words of the corpus. Each column W wi ∈ R dw corresponds to the embedding vector of the w i word in our dictionary W.
Beside word embeddings, two additional embeddings, named entity embeddings, are used to feed our models. (1) One entity embeddings enables to represent what type of entity a word composes. (2) One represents if the word starts, continues or ends the description of an entity. Both use a standard encoding of tags with Begin Intermediate Other End and Single (BIOES)-prefixes [6]. These two first entity embeddings are constructed slightly differently for NER and RE, since in the first, it encompasses tags for entities pre-annotated with PubTator and tags for entities annotated with PGxCorpus types, whereas in the latter, it considers tags for entity types of the corpus, plus special tags that marks pairs of entities between which a relationship may stand.
For the RE model only, a nested entity embedding of size d n is added to word and entity embeddings to represent entity types that may be included in nested entities involved in relations. For each word a nested entity embedding is added for each entity type. Given an entity type, this embedding can take one of two values: (a) absent if the word is not part of one of the two entities potentially related, or if it is part of one, but no entity of the given type is included in the entity of interest; (b) present if the word is part of one of the 2 entities and this one includes another entity of the given type.
Finally, word, entity and nested entity embeddings are concatenated to form the input corresponding to a given word. Let's denote x i the concatenated input corresponding to the i th word.

Named entity recognition model
The core of the CNN model used for NER is described in [2]. We adapted it, along with experiment settings, to fit with the particularity of PGxCorpus that is to encompass about one third of discontiguous or nested entities (2, 347 discontiguous or nested / 6,761 entities, see Table 2).
Recognizing discontiguous entities is a complex and open problem in NLP and this baseline experiment does not aim at tackling it. For this reason, we discarded in the sentences, annotations of discontiguous entities from both our train and test sets (265/ 6,761 entities). Nested entities are considered in our experiment by applying the NER model recursively, as many times as there are nesting levels. Entities discovered during one iteration of the model are considered as input of the next iteration. Given the example of Figure 1, a first iteration will recognize the three entities "VKORC1", "CYP2C9" and "acenocoumarol". Then, the second iteration will consider them as an input to recognize "CYP2C9 genotypes" and "acenocoumarol sensitivity". "VKORC1 genotypes" is discontiguous and consequently discarded from the experiment.
Formally, given an input sequence x 1 , . . . , x N , a classical sliding window approach is followed by applying a two-layer neural network (NN) on each possible window of size k. We denote P the set of BIOES-prefixed tags. Given the i th window, the NN computes a vector of scores s i = [s 1 , . . . , s |P| ], where s t is the score of the BIOES-prefixed tag t ∈ P, associated with the input x i . Scores of the window i are given by the following formula: , where the matrices W 1 ∈ R d h ×k|W| and W 2 ∈ R |P|×d h are the trained parameters of the NN, and h is a pointwise non-linear function such as the hyperbolic tangent, d h is the number of hidden units and k the size of the window. Inputs with indices exceeding the input boundaries, i.e. when i − ( k−1 2 ) < 1 or i − ( k+1 2 ) > N , are mapped to a special padding vector, which is also learned. Scores of each window are finally given to a lattice module that allows to aggregate the BIOESprefixed tags from our tagger module in a coherent manner, to recover the predicted labels. For more details about this layer, please see [2].

Relation extraction model
The model used for RE is a multichannel CNN (MCCNN) described in [5], where it has been successfully applied to the task of extraction of drug-drug and protein-protein interactions. It takes an input sentence and two recognized entities, computes a fixed size representation by composing input word embeddings. This representation is given to a scorer, which computes a score for each possible type of relationships. Sentences with more than two entities are considered by the model iteratively for each possible pair of entities for which a relation may stand, in both directions since relations may be oriented.
The MCCNN applies a CNN of variable kernel size to each input channels of word embeddings. In other words, it considers different embedding channels i.e. different versions of the word embeddings associated with each word, allowing to capture different aspects of input words. Formally, given an input sequence of word representations (i.e. concatenation of word and entity embedding) x 1 , . . . , x N , applying a kernel to the i th window of size k is done using the following formula: where [.] j denotes the concatenation of inputs from channel j, W ∈ R (dw+de)×d h and b ∈ R d h are the parameters, d h is the size of the hidden layer, h is a pointwise non-linear function such as the hyperbolic tangent and N − k + 1 is the number of input channels. For each kernel, a fixed size representation r * ∈ R d h is then obtained by applying a max-pooling over time (here, the "time" means the position in the sentence): We denote K the number of kernels with different sizes. A sentence representation r ∈ R ds (with d s = K * d h ) is finally obtained by concatenating the output corresponding to the K kernels r = [r * 1 , . . . , r * K ] . The sentence representation is finally passed to a single layer NN, which outputs a score for each possible relation type: where W (s) ∈ R ds×|S| and b (s) ∈ R |S| are the trained parameters of the scorer, |S| is the number of possible relation types. The scores are interpreted as probabilities using a softmax layer [1].

Experimental settings
Word embeddings were pre-trained using the method described in [4] on about 3.4 million PubMed abstracts, corresponding to articles published between Jan. 1, 2014 and Dec. 31, 2016. Our models were trained by minimizing the negative log-likelihood over the training data. All parameters -embeddings, weights W and biases b-were iteratively updated via backpropagation. We used a hard tanh function as activation function f . Hyper-parameters were tuned using a 10-fold cross-validation by selecting the values leading to the best averaged performance, and fixed for the rest of the experiment.
For NER, the CNN was fed with word embeddings and two types of entity embeddings (one with PubTator tags, used only for the first iteration of the model and one with PGxCorpus tags used in next iterations) of size d w = 100 and d e = 20 × 2 (20 for each type of tags), respectively. The size of the hidden layer was fixed to d h = 200, the kernel size to k = 5 and the learning rate to 0.01.
For RE, the MCCNN was fed with word embeddings and two types of entity embeddings (one with PGxCorpus entity tags; one to identify pairs of entities between which a relation may stand) of size d w = 200 and d e = 20 × 2, respectively. The size of the nested entity embeddings was set to d n = 5 × |E|, where E is the entity type dictionary.
We used two kernels of size 3 and 5. Following [3], both channels were initialized with pre-trained word embeddings, but gradients were backpropagated only through one of the channels. The size of the hidden layer was fixed to d h = 200 and the learning rate to 0.01.
For both NER and RE, we applied a dropout regularization after the embedding layers [7] with a dropout probability fixed to 0.5. Both models were evaluated using a 10-fold cross validation. Each result of this evaluation is an average of 100 experiments: 10 experiments for each of the 10 folds starting with different random initializations. Random initialization concerns entity embeddings, weights and biases, but not word embeddings not randomly initialized, but pre-trained.
Entity matching: (exact or partial) exact exact partial partial Considering hierarchy: (yes or no) no yes no yes Metric:  Table S1: Detailed performances of the task of named entity recognition in terms of Precision (P), Recall (R), F1-score (F1) and F1-score standard deviation in brackets (SD F 1 ).  Table S2: Detailed performances of the task of relation extraction in terms of Precision (P), Recall (R), F1-score (F1) and F1-score standard deviation in brackets (SD F 1 ). Note that for leaves, performances are unchanged when considering the hierarchy.