Long non-coding and coding RNA profiling using strand-specific RNA-seq in human hypertrophic cardiomyopathy

Hypertrophic cardiomyopathy (HCM) represents one of the most common heritable heart diseases. However, the signalling pathways and regulatory networks underlying the pathogenesis of HCM remain largely unknown. Here, we present a strand-specific RNA-seq dataset for both coding and lncRNA profiling in myocardial tissues from 28 HCM patients and 9 healthy donors. This dataset constitutes a valuable resource for the community to examine the dysregulated coding and lncRNA genes in HCM versus normal conditions.

first strand cDNA synthesis was performed using M-MuLV reverse transcriptase and random hexamer primer. The second strand cDNA was synthesized using RNase H and DNA Polymerase I. The dTTP was replaced by dUTP in the reaction buffer. Following end repair and adenylation, cDNA fragments were ligated to adaptors. Then, 3 μl USER Enzyme was incubated with the cDNA for 15 min at 37 °C followed by 5 min at 95 °C before PCR. Following PCR amplification, products were purified using the AMPure XP system. Finally, library quality was assessed on the Agilent Bioanalyzer 2100 system. The resulting libraries were sequenced on the Illumina HiSeq X Ten System in a 2 × 150 bp paired-end mode.   www.nature.com/scientificdata www.nature.com/scientificdata/ Read alignment and transcript assembly. Figure 1b shows the bioinformatic analysis workflow. The raw sequencing reads 10 were subjected to adapter trimming and base quality filtering by fastp v0.7.0 11 . Clean reads obtained were aligned to the human reference genome (GRCh37) using hisat2 v2.1.0 12 under default settings. Following alignment, the quality of each RNA-seq dataset was assessed through a variety of metrics generated by QoRTs 13 . Transcript de novo assembly for each sample was performed using StringTie v1.3.4b 14 under default settings with the guidance of a reference annotation (GENCODE GRCh37 release 27, -G option). The assembled transcripts of all samples were merged into a single file using the merge function of StringTie with the reference annotation provided (-G option). Other parameters were set to defaults (-m 50 -T 1 -f 0.01 -g 250).
Novel lncRNA gene prediction. The transcripts without matched known transcript information in the StringTie merge output were predicted to be from novel lncRNA genes based on the following criteria: (1) the novel transcripts assembled must have definite strand information; (2) the transcripts must have more than one exon; (3) the transcripts must be more or equal to 200 bp in length; and (4) the coding potential of the transcripts were predicted using CPC2 15 , and only the transcripts labelled as "noncoding" in the output were kept. We ultimately got 205 novel lncRNA genes (ALL_GENE_EXPR_DEG_ANALYSIS.xlsx) 16 .
Expression abundance quantification. All coding genes and lncRNA genes, including predicted novel lncRNA, lincRNA, sense intronic lncRNA, sense overlapping lncRNA and antisense lncRNA genes, were www.nature.com/scientificdata www.nature.com/scientificdata/ incorporated in expression abundance quantification (stringtie_merged.strand.lncRNA.proteincoding.gtf) 16 . Firstly, the transcript sequences (stringtie_merged.strand.lncRNA.proteincoding.fa) 16 were extracted from the reference genome using gffread (https://github.com/gpertea/gffread). Then, the expression of the transcripts was quantified with kallisto v0.43.1 17 under default settings. For comparison among samples, transcript abundance for each sample was normalized with Transcripts Per Million (TPM) 18 . The expression of each gene was determined by aggregating the expression of all corresponding transcript isoforms. Along with transcript abundance estimates, 100 bootstraps per sample were generated (kallisto quant -b 100), which serve as proxies for technical replicates. Figure 2a,b show the expression profiles of coding genes and lncRNA genes in each sample, respectively. Based on the expression of coding genes, hierarchical clustering analysis revealed distinct expression landscapes between the normal and HCM groups for both coding and lncRNA genes. However, samples from each of the  www.nature.com/scientificdata www.nature.com/scientificdata/ three HCM groups were not clustered together, indicating that there may be no significant difference in transcriptome among HCM patients with different genetic backgrounds at least in the sampling stage. Differential expression analysis. Following quantification, the identification of differentially expressed genes (DEGs) between HCM and normal samples was performed using sleuth v0.29.0 19 , which could leverage the bootstraps of kallisto to correct for technical variation. The biological significance threshold was set to a fold change of ±2 fold, and the statistical significance threshold was set to a q-value of 0.05 (−log10 q-value > 1.3). Only genes that achieved both biological and statistical significance were considered as DEGs. We identified 132 and 241 coding genes up-regulated and down-regulated in HCM versus normal samples, respectively (Fig. 2c). We also found 67 and 83 lncRNA genes up-regulated and down-regulated in HCM versus normal samples, respectively (Fig. 2d). We made available the useful information for each sample, including the expression abundance of each gene, testing statistics and DEGs (ALL_GENE_EXPR_DEG_ANALYSIS.xlsx) 16 .

Data Records
The sequencing data in the fastq format have been deposited in NCBI Sequence Read Archive (SRA) 10 . The transcript abundance file for each sample has been deposited in Gene Expression Omnibus (GEO) 18 . Other processed files were uploaded to figshare 16 .

Technical Validation
After quality control, the number of sequenced bases was over 11 Gb in all samples, and the Q20 (the percentage of bases with Phred-scaled quality score ≥20) was over 97% in all samples (Q30 over 93%), indicating that the base quality was sufficiently high for downstream analyses (Table 1). When aligning the clean reads to human reference genome, the overall alignment rate was high (over 97%) in all samples, suggesting little contamination from microorganisms (Table 1).
Taking advantage of QoRTs 13 , a toolkit for quality assessment of RNA-seq dataset, we made cross-comparisons of samples to identify any outliers or systematic errors associated with biological conditions, i.e., different groups ( Fig. 3a-f). Figure 3a shows the distribution curve of estimated insert size for each sample. We found that the curves were relatively smooth (no "spikes") and consistent across samples and conditions, reflecting little technical bias across samples. Figure 3b shows the gene body coverage profile for each sample, and no significant 3' bias was found, indicating that the datasets were not affected by RNA degradation. Figure 3c shows the read mapping rates for different location categories in each sample, from which we did not observe any outlier within each condition, suggesting consistency across samples in terms of alignment. Similarly, we did not observe a disproportionate identification of novel splice junctions in one sample or condition (Fig. 3d). Except for the nucleotide composition bias in the first few cycles that normally occur in Illumina RNA-seq data, the base composition was quite uniform across all other cycles (Fig. 3e). Figure 3f shows the alignment soft clipping rate by cycle in each sample. We did not observe any "spikes" in the curves for all samples and the clipping profiles were generally consistent across samples and conditions. To visualize the high-dimensional transcriptomic datasets, we performed dimension reduction with principle component analysis (PCA). Consistent with the observation in the hierarchical clustering analysis (Fig. 2a,b), we found that all HCM samples clustered together and were distant from normal samples (Fig. 3g), thus suggesting that our data are suitable for differential expression analysis. As expected, the transcriptomic variance among samples was found to be more significant in the normal condition than the diseased HCM condition.
Taken together, we presented a high-quality dataset that was suitable for differential expression and splicing analysis of both coding and lncRNA genes in myocardial tissues between HCM and normal conditions.