Not all birds have a single dominantly expressed MHC-I gene: Transcription suggests that siskins have many highly expressed MHC-I genes

Passerine birds belong to the most species rich bird order and are found in a wide range of habitats. The extremely polymorphic adaptive immune system of passerines, identified through their major histocompatibility complex class I genes (MHC-I), may explain some of this extreme radiation. Recent work has shown that passerines have higher numbers of MHC-I gene copies than other birds, but little is currently known about expression and function of these gene copies. Non-passerine birds have a single highly expressed MHC-I gene copy, a pattern that seems unlikely in passerines. We used high-throughput sequencing to study MHC-I alleles in siskins (Spinus spinus) and determined gene expression, phylogenetic relationships and sequence divergence. We verified between six and 16 MHC-I alleles per individual and 97% of these were expressed. Strikingly, up to five alleles per individual had high expression. Out of 88 alleles 18 were putatively non-classical with low sequence divergence and expression, and found in a single phylogenetic cluster. The remaining 70 alleles were classical, with high sequence divergence and variable degrees of expression. Our results contradict the suggestion that birds only have a single dominantly expressed MHC-I gene by demonstrating several highly expressed MHC-I gene copies in a passerine.

: Comparison of the three primer combinations used to amplify MHC-I in siskins. X indicates an allele that have been amplified in gDNA. The expression has been determined based on the relative read depth of each allele within individual and H indicates an allele that have been identified as highly expressed, L alleles that have been identified as having low expression and H* indicates allele that have been determine to have high expression after corrections.   X  L  L  X  X  L  L  X  X  L  L  Spsp-UA*02  X  X  L  L  0  0  0  0  X  X  L  0  Spsp-UA*03  X  X  H  H  X  X  H  H  0  0  0  0  Spsp-UA*04  X  0  0  L  X  0  0  L  X  0  0  L  Spsp-UA*05  0  0  0  0  0  0  0  0  0  0  0  0  Spsp-UA*06  0  0  0  0  0  0  0  0  0  0  0      Supplementary Table 2: There was a discrepancy between the number of non-classical (N=13) and classical (N=54) MHC-I exon 3 siskin alleles and in order determine how this affected how many sites that were indicated as positively selected by the CODEML analysis we also run a random subset of 13 classical alleles which was repeated 10 times, each time on a new subset of 13 randomly picked classical alleles. When the analysis was run on all classical alleles six sites were identified as positively selected (position 1, 4, 6, 41, 45 and 53 of the alignment in Figure 5) whereas between two and five sites were identified in the repeated runs on 13 randomly selected classical alleles. The positions of the sites in the repeated runs completely corresponds to those in the identified when the analysis was run on all classical alleles.

Run 1 4
Supplementary    Spsp-UA*60  . . . . . . . . . . . . . . . . . S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -. . . . W . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fig. 2: In order to estimate the repeatability and reliability of the Illumina sequencing of MHC-I exon 3 in siskins the duplicates were compared. Four duplicates were included for the gDNA samples amplified with primer combinations 1 and 2, respectively, three duplicates were included for the cDNA samples amplified with primer combinations 2 and 3, respectively. For primer combination 1 two of the duplicates failed and no sequences were retrieved, for primer combination 3 one duplicate failed. For all duplicates, independent of primer combination and DNA type (gDNA/cDNA), the same alleles were identified and the relative read depth per allele was also highly similar when comparing the duplicates.      Fig. 4: Primer combination 2 was used to amplify both gDNA and cDNA alleles, and for the alleles that were amplified with this primer combination (the 68 expressed siskin MHC-I alleles that are illustrated in this figure) the degree of expression could be determined by comparing the relative read depths per allele in gDNA and cDNA. This adjusted relative read depth was calculated as relative read depth in gDNA minus relative read depth in cDNA. A low expression allele (purple) should have an adjusted read depth below zero and a high expression allele (green) should have an adjusted read depth higher than zero. The degree of expression could not be determined for four alleles (white, see Methods section for further details). In the full data set (not shown in this figure) 80 out of the 84 expressed siskin MHC-I alleles were divided into the high or low expression groups depending on their relative read depths in cDNA. The division of alleles into these groups were determined, when possible, by combining information from both primer combination 2 and 3 (primers used when amplifying cDNA). Fifty-nine of the 80 expressed alleles were amplified with both primer combination 2 and 3, six alleles were only amplified with primer combination 2 and 16 alleles were only amplified with primer combinations 3. All alleles that had an adjusted relative read depth below zero (indicated in purple in this figure) were also determined as having low expression when using the full data set and information from primer combinations 2 and 3. Likewise all alleles that had an adjusted relative read depth above zero (indicated in green in this figure) were also determined as having high expression when using the full data set and information from primer combinations 2 and 3. Hence, the two methods used to determine degree of expression level agreed to 100%.

Non-classical Spsp-UA*01 R G L H T W L R V S S C E L L S D G S V R G Y Y R V G Y D G R D F I S F D L G S E R F V P A D S A A E I T R R R W E K E E -V A E G F T N Y L K H E C P E W L R K Y V G Y R
Adjusted relative read depth (gDNA−cDNA)

Illumina MiSeq sequencing of exon 3
We tested five different primer combinations that amplify MHC-I exon 3 alleles and out of these three primer combinations showed good results and were used for the Illumina sequencing ( Supplementary Fig. 4). The binding site of HNalla is partly located in introns between exon 2 and exon 3 and hence it was only used on gDNA whereas Spsps_fwd3 is located in the end of exon 2 and because of this it was only used on cDNA. To each of these primers the Illumina-specific overhang was added before ordering the primers and then used to amplify MHC-I exon 3. Each 25 µl PCR reaction contained either 25 ng gDNA or 25 ng cDNA, 12.5 µl 2X Phusion High-Fidelity PCR Master Mix (ThermoFisher Scientific), 0.5 µM of each primer. The cycling conditions for all three primer combinations were set to 25 cycles at 98°C (10s), 66°C (10s), 72°C (10s) followed by 72°C for ten minutes.
The PCR products were cleaned with Agencourt AMPure XP-PCR Purification kit (Beckman Coulter), following the manufacturer's protocol with slight modifications. A ratio of 1:0.8 between PCR product and beads was used, 80% EtOH was used for cleaning the beads and the elution was done with 43 µl double distillated water and incubation was done in room temperature for 2 minutes. The cleaned PCR products were then checked on a 2% agarose gel in order to determine that fragments of the correct length had been amplified, and the

Filtering of Illumina Miseq data
The primers were removed from all reads with cutadapt v 1.14 2 . In order to make sure that the primers had been properly removed from all reads we searched for the primer sequences and found that it still remained in some of the reads and these reads were also longer than expected. In order to remove the primer sequences and the extra bases all reads where cut to the expected fragment length for each primer combination (primer combination 1: 215bp, primer combination 2:196, primer combination 3: 229). Then we searched for the primer sequences again which were not found. To determine the quality of the sample and evaluate the samples we used FastQC v 0.11.7 3 and MultiQC v 1.4 4 . Overall, all the majority of the raw reads had good quality with an average phred score above 30, although primer combination 3 had slightly lower quality. To further process the raw reads we used DADA2, a program that can remove reads that have too low quality, by clustering the reads (e.g. reassign errors to the parental sequences it arose from) and merging the forward and reverse read 5 . The first step was to trim away bases at the end of the reads that had low quality, i.e a phred score bellow 30, based on the MultiQC output as well as the quality plots within the DADA2 pipeline. For the forward reads Primer combination 1 and 2 had high enough quality and hence did not need any further trimming, but for primer combination 3 the reads were cut to 200 bp. The reverse reads were cut to 150 bp (primer combination 1), 131 bp (primer combination 2, both gDNA and cDNA), 120 bp (primer combination 3) respectively. Next step was to identify and remove any sequences that still remained in the data set which had an overall too low quality. For this DADA2 uses maximum number of expected errors instead of average quality score. The expected errors are calculated as the mean number of errors that would be observed in a large pool of sequences given that the error rate at each position is based on the quality score and that the errors at different positions are independent of each other 6 . Using maximum number of expected errors is an alternative to using average quality score in determining if a given read have high enough quality or if it should be removed from further analysis. The maximum number of error was here set to 2 in the forward read and 3 in the reverse read. To check that these were good thresholds the quality was checked with FastQC and MultiQC (most sequences had a quality of 30 or above). DADA2 then preforms a clustering step, in order to re-assign errors to the parental sequences that they arose from.
Briefly describe, DADA2 starts with the most abundant sequences and then cluster putative errors around this sequence. Error rates are estimated for the data and used, along with the abundance of the sequences, to calculate the probability that the putative error have arisen from the parental sequences. Based on this the sequences are either merged or the putative error is defined as real sequences and moved to a new cluster. This is repeated until no sequences are left in the data set that have a p-value that corresponds to having high probability of being a real sequence. After the clustering the forward and the reverse reads were merged. Finally, chimeric sequences were identified and removed from the data set.
Since DADA2 was not originally designed to filter MHC data, we double checked the chimera results by running the program once with and once without the chimera checking step and then comparing the results. Most sequences identified as chimeras had low frequencies and since chimeric sequences should occur at lower frequency these were discarded. But four sequences in the gDNA data (three with primer combination 1 and one with primer combination 2) had high frequencies and these were checked manually. These sequences could not be explained as chimeras (they had unique bases) and where thus added to the final data set.
After this, the data was further filtered, still separate for each primer combination and DNA type. For gDNA, first any sequences with different lengths than expected, and not varying by three bases, were deleted. For primer combination 2, there seemed to be a general error where an extra base was kept at the start of the sequence, this happened in 15% of the sequences, this extra base pair was removed and then the sequences were kept. Next, a minimum per amplicon frequency threshold was determined and any sequences having a frequency below this threshold was deleted form the data set. Since errors should occur in low frequencies this threshold was determined by identifying a clear jump in frequency. The threshold was set to 1% for primer combination 1 and 3% for primer combination 2. To check that this was a good threshold, the replicates were compared and all matched 100%. All sequences remining after this were considered true alleles. For the cDNA, the filtering was done in a different way since here we wanted to keep alleles that had a low per amplicon frequency. First sequences with the wrong length were removed (again primer combination 2 had sequences with one extra base pair in the beginning). The cDNA data was then compared to the gDNA data (since all expressed alleles should also be found in the gDNA data) and all alleles that had previously been found within an individual in the gDNA were considered to be real expressed alleles. The next steps were to determine if the sequences that had not been amplified in the gDNA should be considered real or artefacts. A threshold was set based on the average read depth for each primer combination and corresponded to that each allele should have at least 100 reads (0.25% for primer combination 2 and 0.3% for primer combination 3). All sequences remaining after these filtering steps were considered real expressed alleles.