Cartography of rhodopsin-like G protein-coupled receptors across vertebrate genomes

We conduct a cartography of rhodopsin-like non-olfactory G protein-coupled receptors in the Ensembl database. The most recent genomic data (releases 90–92, 90 vertebrate genomes) are analyzed through the online interface and receptors mapped on phylogenetic guide trees that were constructed based on a set of ~14.000 amino acid sequences. This snapshot of genomic data suggest vertebrate genomes to harbour 142 clades of GPCRs without human orthologues. Among those, 69 have not to our knowledge been mentioned or studied previously in the literature, of which 28 are distant from existing receptors and likely new orphans. These newly identified receptors are candidates for more focused evolutionary studies such as chromosomal mapping as well for in-depth pharmacological characterization. Interestingly, we also show that 37 of the 72 human orphan (or recently deorphanized) receptors included in this study cluster into nineteen closely related groups, which implies that there are less ligands to be identified than previously anticipated. Altogether, this work has significant implications when discussing nomenclature issues for GPCRs.


Procedure to manually annotate the Ensembl tree (examples NPFFR1 and HTR7)
The annotation of the trees was essentially manual. Each annotation is based on a reasoning which accounts in particular for the annotation of orthologues, the size of the monophyletic groups, the existence of duplicates, the validity of the sequences (visually observed). The text in the manuscript is kept to minimal: due to size consideration we do not provide the reasoning beyond the identification of 147 receptor clades. Some of the annotations hinge on a single critical species, while others are more robust. The most difficult clusters of receptors were left unannotated (see text). Altogether, we aims to provide an annotation that is the most parsimonious explanation of the Ensembl.R92 data.
The data is publically available and it therefore should be easy for the readers to evaluate the robustness of each new receptor clade suggested.
In the next two pages, we illustrate the reasoning of the annotation of the additional NPFFR1 and HTR7 receptors.
"Two reptile genes branch with either of these receptors." The text assume the branching is not strong enough to assign them to NPFFR3, but this is not the case of the guide tree. gar/fishes(11), see ENSLOCP00000019807, named NPFFR3 -Note the internal duplication of cave fish/zebrafish, consistent with the 3R and annotated by "2" on the guide tree N gar/fishes(3), see ENSLOCP00000018374, named NPFFR4 assigned to mammalian NPFFR1 assigned to sauropsid/amphibian/coelacanth NPFFR1 The Xenopus sequence should be closer to the coelacanth NPFFR1 but this is a small inconsistencyit is not parsimonous to suggest an internal duplication based on the data at hand. assigned to fish NPFFR1 Once the NPFFR1 from fish/sauropsid are assigned (below in this page), these three groups of genes require manual annotation.
Distant branching, presence of gar/zebrafish/cave fish and a sauropsid clade suggest new subtypes independent from NPFFR1. Duplication of gar and internal duplication in NPFFR3 suggest that these receptors arose before the 3R i.e. there are two subtypes, one being internally duplicated.
these are closest to mammalian HTR7 and have been assigned as its fish orthologues assigned to sauropsid HTR7 assigned to mammalian HTR7 "5-HT7B (HTR7B) found in gar/fishes(10)/coelacanth, see ENSLACP00000008875, that has likely been cloned in zebrafish". Note the inconsistencies in the automated naming. assigned to mammalian HTR7 "a set of genes in gar/coelacanth/sauropsids(7), see ENSLACP00000011078, that we name 5-HT7C (HTR7C)" hagfish (R.92) is not included to our dataset and not described in this article Procedure to extract the sequence data used to construct the guide trees

Data collection
The predicted GPCR transcripts and associated contents such as gene trees, automatically computed by in Ensembl.R67 (May 2012 release, release 67) were used as a starting point for this study (Flicek et al. 2012  In total, about 14.000 amino acid sequences divided into 71 groups were retrieved.

Data curation
Sequence alignments of transcripts, about 14.000 in total, automatically constructed by Ensembl, were visually inspected and found to be exempt of apparent alignment errors in TM regions; this is because they are composed of closely related sequences and therefore easy to align. Individual errors may be present in transcripts due to sequencing error or mistranslation; in particular these can take the form of a failure to properly detect the start/end codon of the gene and lead to fragmental sequences; furthermore in poorly assembled genomes two or more occurrences of one gene, resembling duplicates, may be found. Automated curation was therefore conducted. First, for each transmembrane segment we manually annotated as reference position the well-defined conserved positions in class A GPCRs (see introduction): N1.50, D2.50, R3.50, W4.50, P5.50, P6.50, P7.50. In case the amino acid at that position varied, the canonical motif at that transmembrane segment was used in its entirety to identify the equivalent position. These "pivot" positions were assigned to each of the seven TMs for each of the 71 groups of aligned sequences extracted from Ensembl, which is de facto equivalent to aligning all the 14000 sequences in a large multiple sequence alignment. A human receptor sequence (the first one to occur in the file) was furthermore taken as a reference for the curation of each of the 71 groups. The gene trees were further manually annotated to assign working names to all transcripts (including to give a working name to sequences without close relatives), and computational scripts used to replace Ensembl IDs with these working names in order to ease the analysis.
Secondly, taking advantage of the reference sequences and the annotated pivot positions, we kept only sequences that have at maximum four deletions (gaps) in each of the seven TM segment and a minimum of 10% sequence identity towards their reference sequences. These values allowed us to eliminate mistranslated sequences and fragments, but in the same time keeping the more distant sequences of invertebrates if any. Using a more stringent criteria (<15%) would lead to overall similar results (data not shown). Human receptors were chosen as reference since they are well characterized, appropriately named, and should not include missing regions in their amino acid sequences. The curation step left us with ~11.000 sequences to work with (Table S1). The procedure worked well and only few cases (GPR119, Adenosine A3, PTGER family) we apparently eliminated too much sequences. Visual observation shows a very good quality of multiple protein sequence alignments used to build the trees, in particular of the seven "pivot" positions.       Phylogenetic and syntenic analysis suggest four GnRH receptors each divided in two groups in zebrafish, see Tello et al., 2008. Chromosomal mapping, see Kim et al., 2011.