Metagenomic discovery of novel CRISPR-Cas13 systems

Dear Editor, CRISPR-Cas systems are crucial adaptive immune components of microbial resistance against the invasion of mobile genetic elements (MGEs) and serve as the core of current cutting-edge genome engineering technolo-gies 1 . Unlike the widely applied Cas9 or Cas12 DNA editing tools in present use, Cas13 is an RNA-guided programable RNA-targeting single effector system that enables gene manipulation at the transcriptional level 2 . At present, only four subtypes of Cas13 have been identi- ﬁ ed 1,3,4 . An expanded catalog of CRISPR-Cas13 systems can provide phylogenetic insights and may offer oppor-tunities for the development of novel RNA-editing tools. By mining bulk metagenomic data (> 10 TB) from various environments, we identi ﬁ ed hundreds of orthologs of known and novel Cas13 systems in this study, the latter of which could be classi ﬁ ed into ﬁ ve novel subtypes based on protein sequence similarity. Notably, the novel Cas13 systems discovered in this study can be developed into ef ﬁ cient RNA editors and expand the RNA-editing toolbox. we ﬁ CRISPR sequenced 20 kb regions of DNA ﬂ anking the CRISPR arrays were for the open reading frames (ORFs). Proteins


Identification of novel Cas13 systems
Over 10 TB of assembled metagenomic sequences were retrieved from the Joint Genome Institute (JGI) 1,2 , NCBI 3 , and European Nucleotide Archive (ENA) 4 databases. The CRISPR arrays were identified using the MinCED software 5 with default parameters. Two 20 kb DNA regions flanking the CRISPR array were extracted for protein prediction. ORFs larger than 400 aa were annotated using Meta-GeneMark 1 6 . The protein sequences of all known subtypes from Cas13a to Cas13d were retrieved from NCBI, and used to generate HMMs using the HH-suite 7 for constructing the Cas13 library. The library was then used to align the newly discovered proteins for identifying the potential novel Cas13 sequences. Incomplete or known proteins were removed to enable detection of the intact and novel Cas13 sequences.

Phylogenetic analyses
For phylogenetic analyses and classification, the candidate sequences were clustered using MMseqs2 8 using the following criteria: sequence identity of 0.5 and minimum coverage of 0.7.
The proteins within each cluster were clustered using MMseqs2 8 with the following parameters for reducing redundancy: sequence identity of 0.9 and minimum coverage of 0.8. Each nonredundant cluster was aligned using MAFFT 9 , with default parameters. The aligned non-redundant clusters were converted into profile HMMs using HHmake 7 , and the profile HMMs were next aligned using the HMM-HMM alignment method in HHalign 7 . The resulting alignment score between clusters was sij, where i and j represent clusters i and j, respectively. The average score was first calculated using the function Sij = (sij+sji)/2. The pseudo-distance was next calculated using the function dij = (log(min(Sii,Sjj))-log(Sij))/2. These distances were used for constructing a dendrogram with the UPGMA algorithm for classification.
The protein sequences were aligned using MUSCLE 10 . The phylogenetic tree was constructed with FastTree 11 and visualized using the online iTOL tool 12 . Owing to the inaccuracy of the metagenomic data during sequencing and assembly, the CD-HIT software was used to remove redundant sequences with a sequence identity threshold of 80% prior to determining the number of proteins 13,14 . The consensus secondary structures and RNA secondary structures were predicted using RNAalifold and RNAfold webservers, respectively 15 . The RNA sequence logos were generated with WebLogo 16 . CRISPSRTarget 17 was used to search for the potential targets of natural crRNA and the positive hits are enlisted in Supplementary Table S4.

Vector construction
The human codon-optimized sequences of Cas13 enlisted in Supplementary Table S1 were synthesized de novo by GenScript. The DNA sequences encoding the Cas13 proteins were cloned into the pCAG-2A-EGFP backbone digested with XmaI and NheI 18 , using the Gibson assembly method. The sequence, which consisted of the U6 promoter and crRNA, was inserted into the pUC19 backbone digested with EcoRI and HindIII. The sequences of the crRNA scaffolds and spacers used in this study are enlisted in Supplementary Tables S2 and S3. The mCherry sequence was cloned downstream of the EF-1α promoter for generating the EF-1α-mCherry plasmid for the mCherry reporter system.
For transfection, the cells were plated in a 24-well dish (Corning) until they reached a confluence of 90%. Lipofectamine 3000 Reagent (Invitrogen) was used for all the transfections.

mCherry disruption system used in mammalian cells
The mCherry disruption system was used to evaluate the mRNA cleavage activity of the Cas13 proteins. To this end, Cas13 and EGFP proteins were co-expressed from the CAG promoter (designated as CAG-Cas13), while its cognate mature crRNA targeting the mCherry mRNA was expressed by the U6 promoter (designated as U6-crRNA). The mCherry was expressed by the EF-1α promoter cloned in a separate plasmid (designated as EF-1α-mCherry). The three vectors were In order to verify the configuration of the crRNA generating efficient RNA knockdown efficiency, the U6-crRNA plasmid expressing crRNA with 5′-DR or 3′-DR sequences was constructed. The MFI was normalized to that of the negative control, comprising HEK293T cells transfected with CAG-Cas13 and EF-1α-mCherry plasmids.
In order to analyze the PFS preference of Cas13e3, targets flanking 16 kinds of PFS combinations were cloned upstream of the mCherry gene in the EF-1α-mCherry plasmid. The MFI was normalized to that of the negative control, comprising HEK293T cells transfected with the CAG-Cas13 plasmid, U6-crRNA plasmid, and EF-1α-mCherry plasmid without PFS targets.

Mammalian mRNA knockdown assays
For the mammalian mRNA knockdown experiments, HEK293T cells were transfected with 600 ng of a plasmid expressing Cas protein and 300 ng of a plasmid encoding crRNA. HEK293T cells were lysed with TRIzol® reagent (Life Technologies) 48 hours post-transfection, and the total RNAs were extracted with a PureLink™ RNA Mini Kit (Thermo Fisher). Finally, quantitative reverse transcription PCR (RT-qPCR) was performed using a HiScript® II One Step qRT-PCR SYBR Green Kit (Vazyme). The sequences of the primers used for RT-qPCR are enlisted in Supplementary Table S5. The results of RT-qPCR were analyzed using the 2 −ΔΔCT method.
Relative expression was normalized to that of a negative control comprising HEK293T cells transfected with a plasmid expressing Cas13 protein and a plasmid encoding non-targeting (NT) crRNA.

RNA-seq analysis
The total RNA extracted from each sample was subjected to RNA-seq analysis using an Illumina NextSeq system. The sequencing reads were mapped onto the human genome (GRCh38) using the Hisat2 (v2.2.1) software 19 , and the reads were next counted using the htseq-count software 20 . The read counting results were used for calculating the fragments per kilobase million (FPKM). The differentially expressed genes were identified using the DEseq2 package 21 . Genes with a fold change greater than 2 and false discovery rate (FDR) less than 0.05 were regarded as differentially expressed genes.