Enriched atlas of lncRNA and protein-coding genes for the GRCg7b chicken assembly and its functional annotation across 47 tissues

Gene atlases for livestock are steadily improving thanks to new genome assemblies and new expression data improving the gene annotation. However, gene content varies across databases due to differences in RNA sequencing data and bioinformatics pipelines, especially for long non-coding RNAs (lncRNAs) which have higher tissue and developmental specificity and are harder to consistently identify compared to protein coding genes (PCGs). As done previously in 2020 for chicken assemblies galgal5 and GRCg6a, we provide a new gene atlas, lncRNA-enriched, for the latest GRCg7b chicken assembly, integrating "NCBI RefSeq", "EMBL-EBI Ensembl/GENCODE" reference annotations and other resources such as FAANG and NONCODE. As a result, the number of PCGs increases from 18,022 (RefSeq) and 17,007 (Ensembl) to 24,102, and that of lncRNAs from 5789 (RefSeq) and 11,944 (Ensembl) to 44,428. Using 1400 public RNA-seq transcriptome representing 47 tissues, we provided expression evidence for 35,257 (79%) lncRNAs and 22,468 (93%) PCGs, supporting the relevance of this atlas. Further characterization including tissue-specificity, sex-differential expression and gene configurations are provided. We also identified conserved miRNA-hosting genes with human counterparts, suggesting common function. The annotated atlas is available at gega.sigenae.org

Six databases containing lncRNAs and PCGs -for five of them -have been selected to create an enriched genome annotation.This set includes: (i) "NCBI-RefSeq" (abbreviated in "RefSeq") and "EMBL-EBI Ensembl/ GENCODE" (abbreviated in "Ensembl") databases, that are frequently updated and widely considered as references; (ii) two databases from FAANG multi-tissue projects, namely the Fr-AgENCODE annotation ("FrAg") and the UC Davis annotation ("Davis"); (iii) the INRAE annotation ("Inrae") previously used in Jehl et al. 13 , for the gene enriched-atlas according to the GRCg6a assembly; (iv) the "Noncode" database dedicated to non-coding RNAs.Comparison of the content of the gene models in the databases (Fig. 1A) shows that PCGs overlap more with each other between databases than lncRNAs do.Thus, for PCGs, the "RefSeq"/"Ensembl"/"FrAg" dataset trio shows a high overlap rate around 95% globally, while consistency drops with the other annotations (75% for "Davis" and around 50% for "Inrae").For lncRNAs, the overlap rate ranges from 50% for "RefSeq"/"FrAg" to 7% for "Ensembl"/"Davis".Note that despite their reference status, the overlap rate does not exceed 37% between both "RefSeq" and "Ensembl" reference databases.Consequently, as indicated by the low percentage of lncRNA overlapping, but also by the admittedly high but lower than 100% for PCGs, these resources appear complementary.As shown in Fig. 1B-top (and Sup.Table 6), while PCG numbers are quite constant globally, with 18,022, 17,007, 14,078 and 18,341 for "RefSeq", "Ensembl, "FrAg" and "Davis" respectively, the number of lncRNAs is more variable, ranging from 5789 at minimum for "RefSeq" to over 10,000 for the other databases, with, interestingly, a higher proportion of mono-exonic lncRNAs for "Inrae" and "Davis" (more than 65% against less than 24% for the other databases).This gene model variability between the six databases is also observed at the transcript level through the number of transcripts, which supports gene models (Fig. 1B-bottom).Overall, the number of transcripts per gene is higher for PCGs than for lncRNAs and shows a greater variability.While the median number of transcripts is between 1 and 3 across the databases for PCGs, it does not exceed 1 for lncRNAs.As the number of transcripts supporting gene locus still low, regardless the PCG or lncRNA biotype, we chose to focus more on gene loci, level at which expression analyses are mostly performed, than on transcripts.
Based on these observations, we integrated the six resources (Fig. 2A) by sequentially adding gene loci from each database, keeping only gene loci that had no overlapping transcripts with transcripts already present in www.nature.com/scientificreports/ the growing catalog (see "Methods" for more details).Two gene models were considered overlapping if at least one of their transcripts had at least one of their exons with a base pair (1 bp) in common and on the same strand (Fig. 2B).Since the conserved gene models in the enriched genome annotation -with their associated transcript models -are the ones that appear first during the successive additions of annotations, the gathering order is crucial.To better characterize the precision of transcripts models from each database, we computed the concordance between the annotated transcription start sites (TSS) and CAGE peaks from the FANTOM project (see Mat. & Meth.) (Fig. 1C).The resulting support was higher for PCG promoters than for lncRNAs: the overlap rate between TSS and CAGE peaks varies between 60% for "RefSeq" and 40% for the other databases for PCGs (except for "Davis" which reaches 15%) whereas this overlap rate do not exceed 15% for lncRNAs.However, the rank of each database with respect to CAGE peaks is preserved, except for "Ensembl" that is lower for lncRNAs with only 5% of concordance.
Interestingly, the PCG and lncRNA gene density per chromosome is correlated (R = 0.62, p val = 10 -5 ) with a higher gene density in micro-chromosomes, which are better annotated since the GRCg7b update (Fig. 1E).

Gene expression across 47 chicken tissues
In order to functionally characterize the 78,323 gene models, especially PCGs and lncRNAs, their expressions were quantified through 47 tissues (40 tissues stricto sensu and 7 cell types) coming from 36 datasets for a total of 1,400 individuals (see "Methods"), as presented in the Fig. 3A and Sup.Table 8.This whole dataset is not exhaustive but tends to represent an important part of the physiology of the chicken by including tissues representing different specific systems such as the nervous (shades of grey), digestive (shades of green), respiratory (shades of purple), sexual (shades of pink), circulatory (shades of brown), immune (shades of blue), or metabolic/energetic systems (shades of red).A total of 63,513 (81%) genes are considered as expressed (Fig. 3B-top), considering inter alia, a normalized expression threshold of 0.1 TPM (Transcripts Per Million) and TMM (Trimmed Mean of M values, see Mat. & Meth.).This includes 22,468 (93%) PCGs and 35,257 (79%) lncRNAs.Interestingly, among the 6,238 genes with no defined biotype, identified as "other", 4,490 (72%) are also considered as expressed.The number of expressed genes per source (Fig. 3B-bottom) averaged 75% but varied from 91% for "RefSeq" to 49% for "Noncode", which is below the other databases due to older gene models and its addition as a final step in the sequential aggregation of gene models.
Regardless of the biotype, the PCAs performed on the expression data (Fig. 3C and Sup.Fig. 1), resulted in a tissue-dependent clustering across all datasets, validating the consistency of the expression data.Interestingly, lncRNAs clustered first the data according to the tissues with the most tissue-specific genes, i.e. testis, brain and immunity (two sub-groups) related tissues.Moreover, considering all expressed genes, the 47 tissues are globally well classified across 14 classes with common biological functions (Fig. 3D).
However, depending on the considered biotype and expression threshold, the number of expressed gene is variable: 88% (19,819/22,468) of PCGs have an expression ≥ 1 TPM in at least one tissue against 57% (20,252/35,257) of lncRNAs.In details, for a threshold of 0.1 TPM, the number of expressed PCGs varies from 9887 (43.8%) in the caecal tonsils to 17,747 (78.6%) in utricle with an average of 14,837 (65.6%)PCGs expressed per tissue.Interestingly, the number of PCGs expressed in all tissues reached 7435, i.e. 75% of PCGs of the tissue with the lowest number of expressed PCGs and 33% of PCGs considered expressed in at least one tissue.For lncRNAs, a higher variability between tissues is observed.The number of expressed lncRNAs ranges from 1189 (3.3%) for the caecal tonsil to 16,708 (46.5%) in testis with an average of 7646 (21.3%) lncRNAs expressed per tissue.The number of lncRNAs expressed across all tissues reaches only 103, i.e. 9% of lncRNAs for the tissue with the lowest number of expressed lncRNAs and 0.3% of lncRNAs considered expressed in at least one tissue.An expression threshold at 1 TPM lowers the average of expressed PCGs to 11,139 (FC = 1.3) and sharply drops the average of expressed lncRNAs to 1972 (FC = 3.9) indicating that, as expected, lncRNAs are less expressed than PCGs within each tissue.All figures of PCGs and lncRNAs per tissue expressions are provided in Sup.Fig. 2 and Sup.Table 9.

Tissue specific expression across 47 chicken tissues
The tissue specificity, computed by the tau value (τ), seems to vary according to the expression threshold applied to consider a gene as expressed.For instance, considering a threshold of 0.1 and 1 TPM in at least one tissue, 86% (15,276/17,654) and 46% (18,417/40,071) of genes are tissue-specific (TS), respectively.According to this, we chose to work only with genes with an expression ≥ 1 TPM in at least one tissue (20,252 lncRNAs and 19,819 PCGs).PCGs and lncRNAs show different τ-values distributions, with lncRNAs globally more TS than PCGs, as already reported in Jehl et al., for chicken and dog 13 .Indeed, 23% (4631) of PCGs have a τ ≥ 0.9 against 68% (13,786) for lncRNAs (Fig. 4A).A local maximum around τ = 0.4 is specifically observed for PCGs, suggesting more ubiquitously expressed genes.
Interestingly, genes that are considered as TS based on their tau value with an expression ≥ 1 TPM in at least one tissue, can still be expressed in several tissues, with highly variable expression profiles across tissues.For instance, by comparing for each TS gene the expression in the two tissues with the highest expression, resulting fold-change values can range from 1 to 10 5 TPM.To consider these cases, TS genes were split into three categories according to the expression pattern across the 47 tissues ("mono_TS", "poly2to7_TS", and "poly8to47_TS", see "Methods").Results showed that 3378 (73%), 1073 (23%) and 180 (4%) PCGs were specific to a unique tissue, a set of n tissues (n ≤ 7) or without a specific group (n > 7), respectively.Same proportions were obtained for lncRNAs with 9858 (72%), 3225 (24%) and 703 (5%) genes, respectively (Fig. 4B).More precisely, Fig. 4C (top) indicates that the proportion of TS genes of each category was very variable across tissues.As an example, 74% of the 4905 genes which are TS for "testis" are mono-specific.In contrast, 0.8% of the 510 TS genes for the "duodenum" are mono-specific (all numbers per tissue are provided in Sup.Table 9).This variability in proportion is related to the other tissues present in the dataset and their common functions.Thus, tissues belonging to a common function tend to be express concomitantly for poly-specific genes.For example, tissues associated to the intestinal system as the duodenum, jejunum, ileum, cecum and colon, tend to express genes concomitantly as do the tissues associated to the brain system (Fig. 4C-bottom).Thus, it should be noted that a gene can be considered as TS despite that no break in its expression pattern is observed.However, the opposite is also possible, a gene may be highly expressed in a tissue without being TS.For example, in the liver, one of the 5 most highly expressed PCG was not identified as TS as well as 5 of the top 15.
To illustrate the interest of gene expression tissue patterns, we examined expression profiles of causal genes associated to Mendelian traits.Out of the 54 Mendelian traits referenced by OMIA 20 , 36 have strong hypotheses regarding the tissue in which the causal gene/variant was likely to affect, according to the trait's name or to the associated literature (Sup.Table 10).Out of these 36 traits, 17 had a causal gene where one of the top two tissues with the highest expression was consistent with the tissue hypothesis.Some examples are shown in Fig. 5A: (i) GNB3, encoding a cone transducing subunit, causal gene of "Retinopathy globe enlarged" 21,22 with a retina-specific expression; (ii) RBP, causal gene of "Riboflavin-binding protein deficiency" associated to  www.nature.com/scientificreports/embryonic death, with a magnum-specific expression which is consistent with the function of riboflavin-binding protein that transports the water-soluble vitamin from the oviduct into the egg white and also from serum into oocyte 23,24 ; (iii) KRT75L4, causal gene of "Frizzle, KRT75L4-related" responsible for a developmental defect of the feather 25,26 with a skin-specific expression.An intriguing case is the "LOC430486" gene (iv) responsible for chicken epilepsy 27,28 and encoding the synaptic vesicle glycoprotein 2A (SV2A) acting in the brain-related tissues.This gene was initially identified in 2011 in the galgal2 assembly (chr25:776,500-777,079-1 exon 29 ) before being removed in subsequent releases because it was no longer predicted.It then reappeared in the galgal5 assembly both in "RefSeq" (LOC101748017) and in "Ensembl" (ENSGALG00000044909) but with a different gene structure and notably on a scaffold (KQ759566.1:4207-4692).Gene models were later harmonized between the two databases in the GRCg6a assembly (LOC101748017/ENSGALG00000050830) and the gene returned to its original position (chr25:1,854,812-1,880,902).It is also present in the GRCg7b assembly (LOC101748017/ ENSGALG00010028753) for which, interestingly, the originally predicted sequence has a unique hit with 100% identity to the gene.Even if it is not tissue specific (τ = 0.81), it is highly expressed in the cortex, brain, hypothalamus and cerebellum like its human ortholog (ENSG00000159164.9;τ = 0.58; Fig. 5B).However, some traits deserve a more in-depth analysis, as illustrated by the blue eggshell.This trait for which the expected "causal" tissue should be uterus (the tissue responsible for eggshell formation) has for causal gene, SLCO1B3 30 which is liver specific (τ = 0.95) like its human ortholog (ENSG00000111700.12;τ = 0.98), tissue where the associated protein transports a wide range of substrates including bile salts.The blue eggshell is due to a variant that leads to an ectopic expression of SLCO1B3 in uterus 31 .

Differential expression between sexes
We also provide a list of 4206 differentially expressed genes (DEG) between sexes.These genes were identified in six tissues for which at least eight birds per sex were available from the same dataset: 2475, 1003, 768, 759, 659, and 233 DEGs were identified for liver, adipose tissue, bone marrow-derived macrophages, bursa of Fabricius, feathers, and the Harderian gland respectively (Sup.Table 11).These genes exhibited sex-biased expression in at least one of the six tissues, and correspond to 816 lncRNAs, 3,276 PCGs (i.e.8.3% and 19.9% of the total lncRNAs and PCGs expressed respectively) and 114 other gene biotypes.Of these, 3384 (80.5%) genes are tissue-specific, i.e. sex-biased in only a single tissue, with similar percentages for lncRNAs (85.9%) and PCGs (79.5%).Most of these tissue-specific sex-biased PCG (75.7%) are expressed in more than three analysed tissues, this percentage is lower for lncRNAs (36.8%) (Fig. 5C).The majority (691/822 genes, 84.1%) of genes showing sex-bias in two tissues or more has consistent fold-change directions between tissues.Of the 4,206 sex-biased genes, we observed an enrichment of Z-linked genes (821 genes, 19.5%) whereas only 5% of the total expressed genes Z-linked.They are characterized by a lower percentage of sex-biased expression in a single tissue (383 genes, 46.6%) compared to total DEG.As shown in Fig. 5D, the incomplete sex chromosome dosage compensation known in chicken was observed with a median of log(fold-change "male/female") reaching 0.76.As for autosomal genes, the majority (419/438 genes, 95.7%) of Z-linked genes with sex biased expression in more than one tissue exhibit consistent effect directions across tissues.

LncRNAs host miRNA genes
Using FEELnc, we classified the 991 chicken miRNAs into positional categories relatively to their closest lncRNA or PCG.We found that 244 (24.6%) and 717 (72.4%) miRNAs are hosted within an intron or an exon of 194 lncRNAs and 627 PCGs respectively.For lncRNAs, 43.8% (107) of miRNAs are within an intron against 51.6% (126) within an exon; for PCGs, 65.4% (469) are within an intron against 32.8% (235) within an exon.Note that 34 lncRNAs and 68 PCGs host more than one miRNA (six at most).Of the 194 lncRNAs, 77 (40%) come from the four resources excluding "RefSeq" and "Ensembl".Focusing on the 179 lncRNAs which are expressed (i.e.TPM ≥ 0.1) in at least one tissue (hosting 228 miRNA), 133 (74.3%) have an expression ≥ 1 TPM, a significantly higher proportion compared to the expected proportion with total lncRNA (74.3% vs. 56.3%,χ 2 = 1.6e-06); the same tendency was found for the 622 expressed PCGs associated to 712 miRNAs (98.2% vs. 87.8%,χ 2 = 2.2e-16).Out of the 179 lncRNAs, 110 (61.5%) are tissue-specific (same proportion as total lncRNA) with 59, 19 and 13 lncRNAs specifically expressed in 1, 2 and 3 tissues, respectively.As expected, this tissue specific rate for lncRNAs is higher than that observed for PCGs for which only 61 genes (9.8%) are tissue specific.Except for MIR155HG, gene names of chicken lncRNAs hosting miRNA(s) are not standardized.We then observed tissue specific cases for miRNA hosted by lncRNA which are conserved in human with consistent tissue patterns between both species.For example, LOC124417505 (ENSGALG00010012701), hosting MIR122-1 within an exon, is identified as liver specific 29 like its human ortholog MIR122HG (Fig. 6A).Similarly, LOC107052837 (ENSGALG00010019651), which hosts within an intron MIR217, is pancreas specific as its human ortholog MIR217HG.In addition, MIR217 is known to play a key role in pancreatic tumors 30 .Other tissue-specific lncR-NAs which host miRNA(s) and newly modeled in this atlas also appear to be orthologous with known human lncRNAs hosting miRNA(s).For instance, NONGGAG008246, considered to be specific to the brain system, contains both gga-mir-219-a and gga-mir-219-b in an intron.Its presumed human ortholog, MIR219A2HG, also contains MIR219A and MIR219B 34,35 , all three specific to the brain system (Fig. 6B).

Classification of the lncRNA with respect to the closest PCG and co-expression
In order to detect biologically meaningful relationships between lncRNAs and PCGs based on the "guilt-byassociation" principle 19 , genes from both biotypes were classified according to their configuration with the closest PCG.Co-expressions between both genes constitutive of all lncRNA:PCG and PCG:PCG pairs were computed across the 47 tissues (Fig. 6C).Out of the 35,257 lncRNAs and 22,468 PCGs considered as expressed, 33,907 (94.4%) and 20,656 (91.9%) are associated to a PCG within a 1 Mb window respectively (see "Methods").No negative and significant co-expressions were identified.Thus, while coexpression can be used to generate hypotheses about the functionality of an lncRNA, the case of data from short-read sequencing must be considered with caution.Indeed, the length of the reads coupled with the low depth locally can sometimes lead to the erroneous modelling of new lncRNA genes (mono-or multiexonic) upstream/5'UTR (untranslated transcribed region) or downstream/3'UTR of the PCG gene of the same strand, due to the inability to join adjacent genes.This phenomenon can lead to erroneous co-expression and is expected to be more intense for downstream/3'UTR that are not well defined for PCG transcript models in our livestock species and can be much longer compared to the upstream/5'UTR.In line with this, we observed that lncRNAs in downstream/3'UTR of a PCG (noted "SS.down" -12.6%) are more co-expressed with it compared to other intergenic configurations, especially lncRNAs in upstream/5'UTR (i.e."SS.up" -5.2%) of a PCG.To illustrate these possible erroneous lncRNA model in downstream/3'UTR of a PCG, some lncRNA:PCG pairs in same strand, coming from different databases were tested by PCR for reliability.Three lncRNA:PCG pairs (LOC121113202/VSIG10L; NONGGAG001811/SARDH; FRAGALG000000006896/PA2G4) were identified in which the lncRNA was in the downstream/3'UTR of the PCG and can be considered as an extension of it.However, three other tested lncRNAs (DAVISGALG000044072/ADBR2 hosted, ENSGALG00010022678/PRPSAP2 in 5'UTR and ENSGALG00010016012/AMOT in 3'UTR) were found to be independent of the associated PCG (Sup.Fig. 3).
Moreover, both lncRNAs and PCGs in "SS.up" and "divergent" configurations with another PCG show higher co-expression values than those in the "convergent" configuration.Excluding pairs in "SS.down" on focusing on intergenic pairs, we observed an enrichment in co-expressed genes ≤ 5 kb compared to those ≥ 5 kb for the "divergent" (11.6% vs. 3.0% for lncRNAs; 29.5% vs. 16.8% for PCGs) and "SS.up" configurations but not for the "convergent" one (1.8%vs 1.6% for lncRNAs; 10.5% vs. 9.5% for PCGs).

Overlap with the previous enriched annotation galgal5 and GRCg6a
This work proposes a genome annotation (.gtf) and a gene annotation (.tsv) built on the GRCg7b assembly, that is considered as the new reference since April 2021 and July 2022 for "RefSeq" and "Ensembl" respectively 10 .This change in assembly and its coexistence with the previous GRCg6a and the alternate one GRCg7w, has led to a significant change in gene identifiers in some databases -particularly for "Ensembl" -which can complicate the transition and lead to uncertainties between studies performed on variable assemblies and annotations.For example, the SLC27A4 well-known protein coding gene is known as LOC417220 in "RefSeq" for galgal5, GRCg6a, GRCg7b and GRCg7w assemblies but in "Ensembl", the associated gene ID is ENSGALG00000004965 for galgal5 and GRCg6a, ENSGALG00010027394 for GRCg7b, and ENSGALG00015027711 for GRCg7w.To enhance the comparison between studies and different genome assemblies, we provide an equivalence table for i) the "Refseq" and "Ensembl" gene identifiers of GRCg7b for genes referenced in both databases, ii) the "Ensembl" gene identifiers of GRCg7b and GRCg7w, iii) the gene identifiers from our previous annotation in galgal5 and GRCg6a to the one in GRCg7b (Sup.Table 12).

Discussion
Our study proposes a solution for enriching the gene atlas of the two "RefSeq" and "Ensembl" chicken reference databases.This involves initially gathering these databases and then, supplementing them with four additional multi-tissue gene model resources, after determining a successive order of addition based on gene model quality criteria.While the use of a unique gene modeling pipeline including all raw sequencing data would be the best solution, our approach offers a good alternative.Indeed, (i) it unifies the two most used genome annotations as the MANE (Matched Annotation from NCBI and EMBL-EBI) project which currently focused on the human 32 (ii) it retains the identifiers of both "RefSeq" and "Ensembl" for common gene loci, (iii) it is faster than a de novo annotation, and is adaptable to major changes in successive versions of the reference databases.Moreover, to facilitate the comparison between studies associated to different genome assemblies and genome annotations, we provided an identifier correspondence between galgal5 and GRCg6a to that of GRCg7b based on our previous gene-enriched model atlases anchored on the "Ensembl" genome annotation (v101 for GRCg6a; v94 for gal-gal5) 13,33 .This atlas increases the completeness of the chicken genome annotation, especially for lncRNAs, which are more difficult to identify than PCGs due to their low tissue-and condition-specific expression 8,34,35 .However, as the vast majority of current gene databases for livestock species are based on short-read data, transcript models are poorly described, regardless of gene biotype, even if this tendency is greater for lncRNAs than for PCGs 10 .As an example, across the six databases used in our study, the maximum median number of transcripts per lncRNA and PCG was one and three, respectively.These numbers are lower than those observed in human, with three and seven transcripts in average per lncRNA and PCG, respectively 10,34 .On the other hand, the overlap rate between transcript TSS and CAGE peaks, which are far from 100%, even for PCGs, underlines incomplete transcript modelling.These models will be clarified with long-read technologies, whose shortcoming today is the ability to obtain sequencing depths comparable to short-read technologies, thus limiting their massive use for studies focusing on gene expression 36 .Surprisingly, whereas the "Davis" database is the only one mainly based on long-read RNA-seq, we can note that this database has a poor overlap rate between TSS and CAGE for PCGs compared to other databases.Moreover, these lncRNAs, mainly mono-exonic, are generally located in the same strand of PCG introns, as for the "Inrae" ones.Indeed, 30.5% and 21.1% of the lncRNAs of "Davis" and "Inrae", respectively are in this case, a higher proportion compared to the other databases which oscillated between 4 and 7% (Sup.Table 13).One interpretation could be that the low sequencing depth makes it difficult to build a full transcript model.Another limitation is that some gene loci can be erroneous, as illustrated in the Vol:.(1234567890 www.nature.com/scientificreports/manuscript, especially for lncRNAs that are on the same strand to a close PCG, and highly co-expressed.These lncRNAs could be in practice an untranslated transcribed region (UTR) of the PCG which are, as lncRNAs, challenging to model and need some complementary analyses 37,38 .Therefore, PCR validation is required to verify the existence of such lncRNAs (i.e. on the same strand to a close PCG) before further analyzing their functions using time-consuming molecular biology studies.Nevertheless, gathering genome annotations from multiple databases gives access to numerous new lncRNAs -precisely 44,428 lncRNAs including all the 5789 and 11,944 loci from "RefSeq and "Ensembl" -since these datasets cover various tissues and conditions.We then provide a gene annotation based on the expression across 47 tissues using 1400 samples from 36 datasets and found 81% of the gene models expressed in at least one tissue.As reported in the literature in crossspecies analysis, lncRNAs are preferentially expressed in sexual tissues such as testis, potentially associated to a pervasive chromatin environment facilitating transcription of putatively non-functional elements enabling the emergence of new genes 39,40 , and in a second time by tissue related to brain 1,[41][42][43] .As expected, we found a higher tissue-specific proportion of lncRNAs compared to PCGs 1,13 .Expression profiles across tissues provide essential information for selecting relevant cell lines to study gene functions using different molecular biology methods 34 .It can also be a first indication of its function, especially for tissue-specific genes, as illustrated by the expression profile analysis of causal genes associated with Mendelian traits.However, it should be noted that tissue specificity is a relative measure, which depends on multiple factors including metric, threshold value or number of tissues.Among these factors, tissue specificity is particularly sensitive to the number and type of tissues.Adding another tissue can greatly vary gene tissue specificity values, especially when just few tissues are considered.Thus, the 40 tissues and 7 cell populations used in our study represent a strong resource.As an example, using a chicken dataset of 21 tissues, we showed in 2020 13 a tissue specificity rate of 25% for lncRNAs vs. 10% for PCGs, against 68% and 23% observed respectively in this study.Tissue specificity also depends on the relationship between analyzed tissues, which explains why some genes are specific to several tissues, often sharing similar functions.
We also provided a list of 4206 genes with a sex-biased expression within six tissues corresponding to 19.8% of the total expressed PCGs, a lower percentage than reported by the human GTEx consortium due to the higher number of analyzed tissues (37% of all genes with 44 tissues 44 ).Interestingly, 80% of sex biased genes are tissue-specific (sex DE observed in a single tissue), suggesting tissue-dependent regulation, even if this percentage is likely over-estimated in our study due to the low number of analyzed tissues (n = 6).This sexbiased tissue specificity does not reflect gene expression patterns across tissues since sex-biased genes tend to have ubiquitous expression across tissues, as previously reported by Oliva et al. 44 .Most of genes with sex biased expression in two or more tissues show consistent effect direction across tissues, especially for Z-linked genes, as previously reported 44 .Some genes reported in previous studies as differentially expressed between sexes in mammals have also been found in chicken: here some examples in liver with genes coding CYP3A4 related to drug metabolism 44,45 , von Willebrand factor C and EGF domains (VWCE alias urg11) predicted to enable calcium ion binding activity 44,46 , polycystin 2 (PKD2), a membrane protein involved in a calcium-permeant cation channel 46 or calcitonin-related polypeptide alpha (CALCA) 47 .
Our findings indicate that most 991 chicken miRNAs are located within a gene, with 75% of them within a PCG and 25% within a lncRNA.These results are in line with those of Liu et al., 2018, who demonstrated that a large fraction of miRNAs in miRBase v21 (1325 out of 1881) are also hosted in a gene 48 and with those of Dhir et al., 2015, who reported, in human, a small fraction of miRNA (17.5%) hosted by a lncRNA 15 .Among the hosted chicken miRNAs, we observed that nearly all of them are embedded in an intron or an exon of its hosting gene.The location of miRNAs according to the nearest gene is an important factor to consider for investigating the transcriptional regulation of primary miRNAs, which is not yet fully understood.Previous studies in human have reported that more than half of miRNAs reside in PCG introns (no study focusing specifically on lncR-NAs) and are thought to be co-expressed with their host genes, deriving from common primary transcripts [49][50][51][52] .This assumption needs to be moderated since Ozsolak et al. 53 reported that a significant fraction of intragenic miRNAs were independently initiated from the PCG transcripts.Additional data would be needed to test the co-expression of miRNA and its host gene.In the absence of the aforementioned data, miRNAs that exhibited conserved genomic localization with their host lncRNA and with similar expression profiles in both human and chicken were analyzed.We then demonstrated that the expression profile of the host lncRNA matched that of the human, providing strong evidence of co-regulation.Notably, three cases of interest were highlighted, including MIR122-1, which is hosted by LOC124417505/ENSGALG00010012701, MIR217 hosted by LOC107052837/ ENSGALG00010019651, and MIR219A and MIR219B hosted by NONGGAG008246, all corresponding to miR-NAs nested within an intron or an exon.Gene names of chicken lncRNAs hosting miRNA(s) are not standardized and should be called MIRxxxHG as MIR155HG, the only lncRNA correctly named, following our work published in 2020 which provided a first functional annotation table of chicken genes related to the chicken genome assemblies, galgal5 and GRCg6a and which identified it as the INRAGALG00000001802 lncRNA 13 .
Analyses of lncRNA:PCG configurations shows that lncRNAs tend to be more genic rather than intergenic.Although, while this observation may vary according to different sources 1,54 it can be explained by (i) the use of unoriented RNA-seq data for the oldest publications, (ii) the consideration of only multi-exonic transcript models by the bioinformatics pipelines to avoid potential false positives corresponding to poorly covered transcripts and, (iii) the drop of short-read RNA-seq cost allowing now to sequence in greater depth and to better consider low expressed transcripts.In our study, we observed an over-evaluation of intragenic lncRNAs, which may be explained by the use of a long-read sequencing database, limited in depth.Focusing on intergenic genes and as shown in the literature 1,54 , an enrichment in "same-strand" is observed.LncRNAs involved in such configurations should be considered with caution, since, as illustrated in the manuscript, some of them are part of a not well-modeled PCGs.Indeed, a lot of PCG isoforms are still poorly annotated, especially for non-model species.For example, as shown by Lagarrigue et al. 34 , for a stable number of gene models, the number of PCG transcripts oscillates between 28,000 and 50,000 for farm species while it exceeds 100,000 for mouse and 150,000 www.nature.com/scientificreports/for human.A very high co-expression value across tissues (or intra tissue according to the study) and a low distance between gene models can be considered as a distrust indicator.As an example, Muret et al. showed with a PCR validation that the FLRL7 lncRNA in "same-strand down" of FADS2 in the mouse constituted in reality a single gene model 5 .However, if some lncRNA:PCG pairs in "same-strand" must be considered with precaution, a considerable part of the constitutive lncRNAs seems to exist independently.Consequently, as well as for the "divergent" or genic lncRNA:PCG co-expressed pairs, it is possible to propose hypotheses concerning the lncRNA function applying the "guilt-by-association" principle 19 .Indeed, a significant expression correlation and a short distance between two gene models can supposed a common regulation or even an implication of the lncRNA in the regulation of the PCG [55][56][57][58][59] .The co-expression of lncRNA:PCG pairs in "divergent" configuration could be related to a bidirectional-promoters which could activate the expression of the PCG through an alteration of the promoter regions by the lncRNA (named pancRNA for promoter-associated non-coding RNA) 55,60,61 .For example, Hamazaki et al. showed that the lncRNA pancIl17d, in "divergent" configuration with the PCG Il17d is crucial for pre-implantation development of mouse through an upregulation 62 .This pancRNA expression leads to a DNA demethylation and an upregulation to its associated PCG.Interestingly, across all the lncRNA:PCG and PCG:PCG configurations, no significant negative correlation was identified.Indeed, as observed in other species such as human 1 , dog 63 , and even in plants 35 , only a tiny fraction of lncRNA:PCG pairs showed a significant negative co-expression.Even if some cases of silencing are well-known, this suggest that lncRNAs tend to act as positive regulators or cofactors improving the expression of near genes through various mechanisms 3 .Finally, considering all configurations, lncRNA:PCG pairs have lower co-expressed pairs across tissues compared to PCG:PCG.This observation highlights the tissue (and condition) specificity of lncRNAs compared to the ubiquity of PCGs 1,64,65 .Thus, in order to establish robust hypotheses about the association of function between a lncRNA and a nearby PCG, it is essential to consider the co-expression within the tissue(s) of interest and for a unique condition.The combined use of the configuration of lncRNA:PCG pairs and their co-expression can help to orient the hypotheses and the biological experiments to set up in order to better understand the regulatory functions of lncRNAs.
In conclusion, if your research field is focused on gene expression analysis in chicken and you use this enriched atlas, 24,102 PCGs and 44,428 lncRNAs containing all gene loci from "RefSeq and "Ensembl" instead of only 18,022 and 17,007 PCGs and 5789 and 11,944 lncRNAs for "RefSeq and "Ensembl" respectively.Among them, note that 19,819 PCGs and 20,252 lncRNAs have an expression ≥ 1 TPM in at least one tissue, ensuring an easy handling for further investigation by molecular biology methods to gain insight into their function.For all these genes, we also provide a table containing different genomic and functional information/feature (Sup.Table 1 to 5) soon available through a web interface.The atlas and related information will be valuable for researchers working on gene expression (PCGs and/or lncRNAs), such as those interested in unraveling the molecular mechanisms linking non-coding variants and relevant phenotypes.

Gene-enriched atlas construction
Origin of the six genome annotations Gene models used to build the enriched genome annotation come from 6 genome annotations, all based on multitissue resources: (i) both reference genome annotations according to the GRCg7b assembly: "RefSeq" v106 67 and "Ensembl" v107 68 , this latter has integrated the GENESWitCH project data; (ii) both gene model datasets from FAANG pilot projects 69,70 according to the GRCg6a assembly (GCF_.5): the FR-AgENCODE project 71 involving 11 tissues represented by 2 males and 2 females by tissue and the FarmENCODE proect including 15 tissues with 1 male and 1 female; (iii) and two other datasets including gene models from the previous atlas as presented in Jehl et al. 13 produced according to the galgal5 assembly (GCF_000002315.4)and NONCODE v6.0 72 including only non-coding gene models from the literature and other public databases according to the galgal4 assembly (GCF_000002315.2).Contrary to all projects which used short-read sequencing, FarmEN-CODE includes samples sequenced with Oxford Nanopore long read Technology as presented in Guan et al. 73 .For genome annotations produced on a previous assembly, a remapping to GRCg7b was performed using the NCBI genome remapping service 74 .
Prioritization criteria CAGE data used to prioritize the different gene models come from the FANTOM5 project 75 .Peaks coordinates considered as robust according to the project were converted from galgal5 (GCF_000002315.4) to GRCg7b using the NCBI genome remapping service 74 .The transcript is then considered to be well modelled in 5' if its TSS (Transcription Start Site) overlap a peak within + /-30 bp.All genome annotations previously presented were added sequentially considering gene model quality characteristics, the concordance with CAGE peaks, and the popularity of each databases as presented in the "Results" part, namely: 1-"RefSeq"; 2-"Ensembl"; 3-"FrAg"; 4-"Davis"; 5-"Inrae"; 6-"Noncode" (Fig. 2A).

Rules of aggregation
Two gene models were considered overlapping if at least one of their transcripts had at least one of their exons with a base pair (1 bp) in common and on the same strand (Fig. 2B).Overlapping detection was performed using the "intersect" function (parameters -wo -s) of the BEDTool v2.25.0 toolset 76 .To iprove the successive addition of the different gene models, a decomposition by biotype class was used (See Sup.Table 14).This approach limited the overlap of similar gen patterns, but with different or unassigned biotypes, and was more sensitive to genes hosting other genes, such as miRNA-hosting PCGs for example.

Biological sample used for gene expression
36 datasets including a total of 1400 samples were used to represent the 47 tissues composing the atlas.As these datasets are publicly available (on SRA and/or ENA), the project numbers and the number of samples are available in the Sup.Table 8.

Gene expression quantification and expression criteria
FASTQ files were mapped on the GRCg7b reference genome (GCF_016699485.2) and expression quantification according to the enriched .gtfannotation file was performed by projects and using the "rnaseq" v3.8.1 pipeline (-aligner star-rsem) from nf-core 77,78 providing raw counts and TPM (Transcripts Per Million) normalized counts.For each tissue in each project, a median of TPM normalized expressions across samples was calculated.For tissues present in several projects, the median was calculated using the TPM medians previously calculated in each project.
A gene was considered as expressed if its median expression (see previous §) was ≥ 0.1 TPM in at least one tissue and if at least 50% of samples of a tissue for a given project have a reads number ≥ 6 and the normalized TPM and TMM (Trimmed Mean of M values) expression ≥ 0.1.TMM normalized expression was obtained from the raw counts by the trimmed mean of M-values scaling factor method 79 using the R package edgeR (v3.32.1) 80 with the "calcNormFactors" function (to scale the raw library sizes) and "rpkm" function (to scale the gene model size).Finally, genes were classified into three expression categories: genes with expression (i) < 0.1 TPM in all tissues (ii) ∈ [0.1, 1[TPM in at least one tissue iii) ≥ 1 TPM in at least one tissue.

PCA and clustering
PCA was performed with the "PCA" function (scale.unit= T) of the FactoMineR (v2.7) 81 package and considering the log 2 (TPM + 1) expression of the expressed genes.The dendrogram was based on the distance matrix computed with the (1-Pearson correlation) of the log 2 (TPM + 1) expression of the expressed genes and the hierarchical cluster analysis was done using the "ward.D" agglomeration method and the "hclust" function.

Tissue-specificity analysis
Tissue-specificity was assessed with the log 10 median expression of tissues.The tau (τ) metric was used 82 , providing a score between 0 (gene expressed at the same level in all tissues) and 1 (gene expressed in exactly one tissue) and following the formula: with x t the log 10 median expression of the gene of interest in the tissue t and among the T tissues.
A gene was considered as tissue specific (TS) for a τ ≥ 0.90 and in some analyses (related to Fig. 4) with a filter on the expression (≥ 1 TPM in at least one tissue).Genes considered as tissue-specific (τ ≥ 0.90) were split into three categories based on the expression profile and whether or not a gap -define as a difference in expression by a factor of 2, i.e.FC ≥ 2 -was observed between tissues expressions when they were ordered in descending order.The three categories of tissue specific expression were defined as follows: genes specifically expressed in (i) a unique tissue (mono_TS), (ii) a group of 2 to 7 tissues (included) (poly2to7_TS) or (iii) a group of 8 or more tissues (poly8to47_TS).

GTEx data analysis
The median gene-level TPM for 53 tissues from RNA-seq data of GTEx Analysis V8 was used (https:// gtexp ortal.org/ home).The list of the 53 tissues, their abbreviations and color codes used are available in Sup.Table 16.

OMIA gene lists
Genes related to a known Mendelian trait or disorder were obtained from the OMIA (Online Mendelian Inheritance in Animals) catalog 20 .A manual reassignment was performed for C1H12ORF23, GC1, KIAA0586, LOC430486 genes that were updated in the GRCg7b assembly (Sup.Table 10).

Figure 1 .
Figure 1.Characteristics of the gene-enriched annotation and its component sources.(A) % of overlapping PCGs (blue) and lncRNAs (red) having at least 1 bp in common for exons on the same strand, between the databases.% in upper triangle refer to x-axis.The number of loci per database is indicated in line.(B) Number of PCGs and lncRNAs and number of transcripts per gene by databases.Diamonds indicate the average value.mono.: monoexonic.(C) % of PCG and lncRNA TSSs overlapping a CAGE peak within + /-30 bp.(D) Proportion of gene biotypes in the "RefSeq" and "Ensembl" reference databases and in the enriched genome annotation.(E) Correlation between lncRNA density and PCG density across the chicken macro-, meso-, micro-and sexual chromosomes.

Figure 2 .
Figure 2. Gene-enriched annotation construction.(A) Origin and order of the successive addition of the 6 genome annotations used to build the gene-enriched annotation.TSS: transcription start site of the transcript models, ovlp.: overlap.(B) Aggregation rules applied each time a new genome annotation is added with respect to the pre-existing gene models.

Figure 3 .
Figure 3. Gene expression across 47 chicken tissues.(A) Illustration of the 47 tissues used for gene expression.Numbers in parentheses correspond to the number of samples and the number of constitutive datasets.Corresponding colours are indicated in the adjacent circles.Full tissue names are available in Sup.Table 15 (B) Top: Numbers of PCGs (blue) and lncRNAs (red) considered as expressed applying a normalized expression threshold of 0.1 TPM and TMM.Bottom: % of expressed genes according to the constitutive sources of the enriched annotation.(C) Principal component analysis based on the gene expression of expressed PCGs (left) and lncRNAs (right).(D) Hierarchical clustering of the expressed genes for the 47 tissues and performed using "1-Pearson correlation" distance and "ward" aggregation criteria.

Figure 4 .
Figure 4. Tissue specificity across 47 chicken tissues.(A) Distribution of τ values for lncRNAs (red) and PCGs (blue) with an expression ≥ 1 TPM for chicken (left) and human (right).The red dotted line indicates the first local maximum associated to ubiquitous genes.(B) Distribution of lncRNAs (red) and PCGs (blue) with an expression ≥ 1 TPM according to the number of tissues for which the gene is considered as tissue-specific.(C) Number of mono_TS (light brown), poly2to7_TS (pink-brown) and poly8to47_TS (dark brown), tissue-specific (TS) genes per tissue (top) and clustered heatmap based on pairwise association for tissue-specific genes for a group of several tissues (2 to 7) (bottom).Full tissue names are available in Sup.Table 15.

Figure 5 .
Figure 5. Illustrative cases of gene expression interest for functional analyses.(A) Expression profiles in TPM of 3 tissue-specific genes associated with a Mendelian trait in chicken: GNB3 retina-specific (top), RBP magnumspecific (middle), and KRT75L4 (bottom) feather-specific (bottom right).Both "RefSeq" and "Ensembl" gene identifiers are provided.(*) indicates that the gene identifier equivalence is not provided by BioMart but was found by overlap between the two reference genome annotations.(B) Expression profile of SV2A in TPM in chicken (top) and human (bottom).Full tissue names for chicken are available in Sup.Table 15.The 53 human GTEx tissues are ordered, abbreviated and coloured as indicated in the Sup.Table 16.(C) Left: Number of differentially expressed genes (DEG) shared between the 6 tissues.Right: Number of genes identified as DEG in at least one tissue and considered as expressed across the 6 tissues.(D) log2(Fold Change) of differentially expressed genes (DEGs) between sexes for 6 tissues and excluding the "Z" chromosome.

Figure 6 .
Figure 6.Genomic configuration and co-expression using the extended annotation.(A,B) Conservation of the genomic configuration (left) and expression profile in TPM (right), between the 47 chicken tissues (top) and the 53 human GTEx tissues (bottom).Mir expression is shown in the yellow rectangle.(A) MIR122HG gene, host of mir122 identified in human, has an equivalent locus in the chicken reference databases but is unnamed.(B) MIR219A2HG gene, host of mir219a2 and mir219b identified in human, has an unnamed equivalent locus in the extended chicken annotation but not in the reference databases.(*) indicates the old gene identifier for the human "RefSeq" database which is no longer used, the gene model being removed.(C) Classification of lncRNAs (top) and PCGs (bottom) according to their closest PCG and co-expression.SS. up: Same strand up, SS. down: same strand down. https://doi.org/10.1038/s41598-024-56705-y