CellectSeq: In Silico Discovery of Antibodies Targeting Integral Membrane Proteins Combining In Situ Selections of Phage Displayed Synthetic Antibodies and Next-Generation Sequencing

Synthetic antibody (Ab) technologies are e�cient and cost-effective platforms for the generation of monoclonal proteomic tools against human antigens. Yet, they typically depend on puri�ed proteins, which exclude from interrogation integral membrane proteins that require the lipid bilayers to support their native form or function. Here, we present a novel Ab discovery strategy, termed CellectSeq, for targeting integral membrane proteins presented on native cells in complex environment. As proof of concept, we targeted the challenging tetraspanin receptor CD151, a target linked to cancer. First, we optimized in situ cell-based selections to enrich Ab pools for antigen-speci�c binders. Then, we designed novel NGS procedures to explore Ab pools diversities and abundances with enhanced accuracies. Finally, we developed novel motif-based scoring and error �ltering algorithms for the comprehensive interrogation of NGS data to identify Abs with high diversities and speci�cities, even at extremely low abundances. We identi�ed highly selective and diversi�ed Abs against CD151 with abundance as low as 0.00009% for which manual sampling or identi�cation using Abs abundances in NGS data would have been impossible. Here we show that CellectSeq enables the rapid discovery of diversi�ed and selective antibodies against CD151, with implications for other integral membrane proteins and cell-surface receptors.


Introduction
The application of antibodies (Abs) 1 for targeting cell surface proteins has prompted the development of synthetic human Abs 2 . By this method, synthetic phage-displayed libraries containing > 10 10 unique Abs can be constructed to rival the combinatorial diversity of natural in vivo immune repertoires and, in many ways, outperform natural repertoires for the production of Abs with high a nities and speci cities 2, 3 .
Synthetic Ab technologies have also proven amenable to automation to enable high-throughput methods of selection to target large families of soluble antigens 4-6 .
However, a major limitation to both in vitro and in vivo methods for antibody generation is the di culty of targeting multi-pass integral membrane proteins, which generally cannot be puri ed in a native form in the absence of a cell membrane. Integral membrane proteins remain a recalcitrant group of critical targets for Ab development due to their inherent association with the lipid bilayer, differential multiconformational states 7,8 , and interactions with other cell surface proteins 9,10 . Moreover, multi-pass integral membrane proteins often lack large, structured domains in their extracellular regions 11,12 , and thus, pose a particular challenge for recombinant expression and puri cation 13 . Given that many essential biological processes and diseases depend on integral membrane proteins, the di culties in targeting this large subset of the human proteome is a major roadblock in many areas of biological research and drug development 14,15 .
Moreover, the elevated expression of CD151 is correlated with cancer patient mortality and enhanced metastasis of tumors 28, 29 . The primary role of CD151 in cancer appears to be its ability to organize the distribution and function of growth factor receptors and integrins 25,30 . Consequently, CD151 may guide the migratory activity of tumor cells to induce invasiveness and metastasis. CD151 also modulates the pharmacological response of therapeutics that antagonize other cell surface receptors 31 , and also appears to synergize and modulate intracellular signal activities in cancer. For example, integrinassociated CD151 may drive HER2 evoked mammary tumor onset and metastasis, and may enhance the activation of HER2 and other receptor tyrosine kinases by regulating dimerization [32][33][34] . Thus, CD151 is an integral membrane protein that may be a promising target for the development of antibodies that can antagonize the interactions mediated by its extracellular domains. However, the recalcitrant nature of CD151 receptor, due to its diminutive stature protruding only 4-5 nm above the membrane and displaying limited surface exposed regions 35 , makes it challenging to target (Fig. 1).
Recently, we reported optimized methods for in situ selections with phage-displayed synthetic antibody libraries with native antigen on live cells to develop a large panel of selective antibodies for integrin-α11/ β1, a marker of aggressive tumors that is involved in stroma-tumor crosstalk 36 . Manual screening of over one thousand phage clones identi ed unique Abs with strong and selective binding to cells expressing integrin-α11/β1, and notably, most of these Abs did not recognize the puri ed antigen, suggesting that cell-based selections were essential for targeting native epitopes 36 . Moreover, next generation sequencing (NGS) analysis of the abundance of unique clones in the selection pools showed that most of the Abs identi ed by clonal screening were among the most abundant and enriched amongst the NGS sequences, but intriguingly, many other sequences were also identi ed, suggesting that clonal screening had only isolated a small subset of antigen-speci c clones 36 .
Here, we have further optimized in situ cell-based selection procedures to enrich Ab-phage pools for antigen-speci c binders. In addition, we implemented a novel in silico analysis to e ciently explore and identify unique antigen-speci c clones against CD151 in the enriched pools. In conjunction with rapid and cost-effective gene synthesis and recombinant Ab production strategies, the antigen-speci c Ab-phage sequences were puri ed as Abs for direct assessment of cell-surface antigen recognition. We have collectively termed this methodology "CellectSeq", which utilizes phage display, in situ selections, nextgeneration sequencing, and motif-based scoring and error ltering algorithms for the comprehensive interrogation of candidate Abs in enriched but highly diverse Ab-phage pools. We used the CellectSeq to target native CD151 displayed on cells and discovered speci c anti-CD151 Abs with frequencies as low as one in a million NGS reads. Thus, we show that CellectSeq can identify rare but highly selective and diversi ed Abs targeting integral membrane proteins, without the need for screening of individual clones Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js at the phage level. The technology should be applicable for the generation of Abs targeting many integral membrane proteins that have proven recalcitrant to conventional in vivo and in vitro methods.

Cell-based in situ selection for anti-CD151 Abs
To generate Abs targeting CD151, we used the phage-displayed Library F 37 of synthetic antigen-binding fragments (Fabs) that offers advantageous features for CellectSeq ( Figure S1). First, Library F is extremely diverse (> 10 10 unique members) and precisely designed to ensure that most members are stable and well-displayed on phage 38 . Second, the library proves functional for selections with either puri ed antigens 37 or cell-surface antigens 36 and has yielded numerous selective Abs per selections 36 .
Third, the library was constructed with a single, highly stable human framework resulting in negligible display bias where most library members are presented at similar levels 37 . Also, the abundances of individual clones in pools enriched for target antigens are highly correlated with relative a nities 39 ; this property enhances NGS analysis based on enrichment ranking, allowing for the identi cation of highly selective and high a nity clones. Fourth, the synthetic Abs are diversi ed at only four complementary determining regions (CDRs; H1, H2 and H3, and L3), which permit standard NGS procedures utilizing primers that anneal to common framework regions in a cost-effective manner ( Figure S2 & S3). Fifth, each of the four CDRs is composed of de ned amino acid positions with restricted diversities. Therefore, the NGS data quality can be very accurately evaluated by assessing any deviations from the xed framework or occurrence of unexpected codons at diversi ed positions. For instance, CDRs H1 and H2 contain only six or eight binary degenerate codons and offer a diversity of 64 and 256 unique sequences, respectively ( Figure S1D). Conversely, CDRs L3 and H3 are much more diverse in terms of loop lengths (3-7 or 1-17 degenerate codons, respectively), and in terms of sequence composition (encoded by de ned ratios of nine codons encoding nine amino acids). The CDRs L3 and H3 offer a theoretical diversity of the order of 10 7 and 10 17 unique sequences, respectively ( Figure S1D). Ultimately, the four CDRs combined offer a practical diversity approximating 10 11 unique clones 37 . Thus, the highly diverse Library F, with de ned length and chemical diversity encoded in CDRs L3, H1, H2, and H3, permits the precise probabilistic detection and elimination of artifactual CDR sequence combinations from NGS data, such as those derived from PCR sequence ampli cations required for the NGS Illumina process 40 (see Material and Methods).
We performed in situ selections against cell-surface CD151 on live cells, where CD151 is targeted at its native cell-surface environment. For cell engineering, we selected the HEK293T cell line because it grows rapidly in suspension and exhibits high display of transgenic cell surface proteins 41 . To enrich binders for CD151, we engineered the HEK293T cells to stably overexpress CD151 (HEK293T-CD151+; positive cells) ( Figure S4A). Conversely, to deplete non-target selective binders we engineered HEK293T cells that stably expressed a short hair-pin RNA that depleted CD151 mRNA, and consequently, reduced cell-surface display of CD151 (HEK293T-CD151-; negative cells) ( Figure S4A). The strategy of CellectSeq in situ Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js selections utilizes multiple rounds of selection against antigen positive and negative cells, where it aims to produce a positive Ab pool enriched with selective clones for the target antigen, and a negative Ab pool enriched with non-speci c clones (background).
To this end, the naïve phage pool representing Library F was subjected to four rounds of selections with the engineered cell lines (Fig. 2). Round 1 consisted of a positive selection on HEK293T-CD151 + cells to enrich for Fab-phage that bound to CD151, followed by elution of bound phage and ampli cation by passage through E. coli. In round 2, we employed a strategy whereby phage pools were exposed to control cells HEK293T-CD151-to deplete clones that bound to other cell-surface antigens, followed by positive selections with HEK293T-CD151 + cells. Round 3 repeated the round 2 process using the ampli ed phage pool from round 2. For the last round the ampli ed phage pool from round 3 was split into two pools, and then subjected to a round 4 selection process that involved elution and ampli cation of phage bound to either HEK293T-CD151 + cells (positive selection) or to HEK293T-CD151-cells (negative selection) (Fig. 2). Thus, the round 4 phage selection output consisted of two pools, a positive and a negative pool. After the four rounds of selection for binding to in situ CD151, we manually isolated 96 random Ab-phage clones derived from the round 4 phage output of HEK293T-CD151 + cells (positive pool). We screened all 96 clones by cellular phage ELISA 42 , where phage signals were measured for binding to HEK293T-CD151 + cells and compared to control HEK293T-CD151-cells. Here, we identi ed 49 phage clones, with binding signals 5-fold or greater over controls deemed as positive binders for cellular CD151 ( Figure S5). After Sanger DNA sequencing analysis, all 49 clones shared the same sequence of clone CD151-1 (Table 1), indicating the Ab selection enriched for an immune-dominant clone. Accordingly, manual Ab screening failed at deriving multiple unique and diversi ed CD151 selective clones; consequently, we next performed NGS analysis of the output selection.

NGS enrichment ranking selection for anti-CD151 Abs
To identify unique CD151 speci c Fab-phage clones in the round 4 selection output, we performed NGS analysis to explore the output diversity and relative abundance of every Ab clone. Therefore, we deep sequenced the round 4 output derived from the positive and negative pools. This allowed us to obtain CD151 selective sequences (derived from the positive pool), and non-speci c background sequences (derived from the negative pool). The phage DNA from the Ab selection output pools were subjected to PCR ampli cation resulting in amplicons with Illumina NGS adaptor sequences and unique barcode identi ers that anked the region of CDRs L3 and H3 ( Figure S2 & S3). The amplicons from each output pool (positive and negative) were quality controlled for correct size, puri ed, and quanti ed, then normalized and pooled, and nally sequenced using an Illumina HiSeq 2500 instrument (see Materials and Methods). Besides the Illumina universal sequencing primers (PE1 and PE2), the NGS runs also included a custom primer that allowed for the complete sequencing of CDRs H1 and H2. Thus, the three primer reads (PE1, PE2, and custom; Figure S2 & S3) provided the complete sequence coverage of the four diversi ed CDRs in Library F 37 ( Figure S1). We performed duplicate NGS runs, and each run controlled for high sequence quality scores 43,44 . The sequences were ltered from instrument sequencing Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js errors using per base high quality score cut-off of Q = 30, which corresponds to 1:1000 of incorrect base call 43 . Following, all sequencing reads from the duplicate NGS runs were combined and deconvoluted. The three different primer reads (PE1, PE2, and custom) for each clone were transformed into a single sequence to derive the complete synthetic Ab sequence (see Materials and Methods).
The obtained high quality nucleotide sequences were then compared to the designed sequence repertoires of Library F 37 to remove technical errors inherent to Illumina sequencing and PCR ampli cation. For each Ab clone, the nucleotide sequences were evaluated for codon deviations from the synthetic design of the xed framework and restricted CDR positions ( Figure S1A-B). Any divergent sequences from the synthetic library were discarded. Subsequently, the sequences were ltered for potential PCR-induced artifacts that may arise during the NGS sample preparation and Illumina sequencing process (see Material and Methods). This may occur due to incorrect annealing amalgams (i.e. combinations) of different clones 40 , which for our case may be driven by the xed Ab framework coding region (non-CDR). Therefore, for every sequence we obtained the frequencies (i.e. number of observations) of CDRs H3 and L3, respectively, since these two CDRs are the most diversi ed in the synthetic library and drive the majority of a nity interactions with the antigen 37 . We then identi ed valid L3/H3 pairs by calculating a frequency cut-off to determine a minimal threshold of valid occurrences, with all below-threshold pairs ltered from the selection pool (see Materials and Methods). Thus, we obtained 7,541,189 and 7,250,873 high quality NGS reads for the positive and negative pool, respectively. The reads were then translated into amino acid. This process ultimately yielded 23,671 and 56,352 unique amino acid sequences in the positive and negative pools, respectively.
To perform NGS Ab enrichment ranking selection of potential CD151 selective clones, the unique highquality sequence reads from each pool were parsed based on CDR sequences and observation counts.
For each unique paratope we plotted the counts in the positive pool (x-axis) versus the ratio of its abundance (i.e. frequency) in the positive pool relative to the negative pool (y-axis) (Fig. 3). To estimate the number of potential unique CD151 binding clones in the plot, we de ned an upper-right quadrant of putative binders. Here, the upper-right quadrant sequences represent observations counts of more than 200 in the positive pool, and more than four-fold enriched relative to the negative pool (Fig. 3). After performing comparative analysis of the unique sequences, the NGS enrichment ranking revealed all upper-right quadrant clones as close homologs of clone CD151-1; all showing more than 80% sequence identity in both L3 and H3 sequences. This nding reveals that the Ab selection is enriched for homolog clones with a potentially similar targeted epitope (immunodominant), where CD151-1 is the most abundant and selective clone.

Motif-based algorithm identi es selective and diversi ed Abs against CD151
Due to the lack of Ab diversity derived by both manual selection of Ab clones and NGS enrichment ranking, we developed a novel motif-based algorithm to identify highly selective Abs for CD151 from the deep sequenced phage pools. The in silico strategy for scoring CD151 selective Abs is based on exploring Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js all possible sequence motifs (i.e. consensus motifs) in the positive pool and scoring their enrichment over the negative pool (Fig. 4A). This follows the premise that highly selective Abs are enriched with paratope motifs (i.e. linear information) that recognize the target antigen, whereas non-selective Abs lack such enrichment 45 ( Figure S6). Therefore, for each Ab clone in the positive pool (i.e. candidate) (Fig. 4A1) we explored the entire space of linear information by exhaustively enumerating all possible motifs matching its CDR sequences 46 , and obtained the frequencies (number of matching sequences / total number of sequences) of every motif in the positive and negative pools (Fig. 4A2). According to the premise above, the high enrichment of the motifs in the positive pool relative to the negative pool implies the Ab candidate is potentially highly selective (Fig. 4A3). Thus, we analyzed each Ab in the positive pool for the selective binding to CD151 by scoring the separation between the two distributions of frequencies of the motifs in the positive and negative pools (see Methods for details). To this end, we calculated the t-test 47 to score the separation of the two distributions, then we calculate the p-value to evaluate the statistical signi cance of the t-test 48-50 . Thus, the lower the p-value the higher is the separation between the two distributions, thus, the higher is the selectivity of the candidate Ab. Finally, we applied the stringent pvalue cut-off of 10 − 10 to identify highly selective Ab clones (see Materials and Methods). Therefore, this motif-based in silico strategy allowed us to explore rapidly and exhaustively the selectivity of all Ab clones in the positive pool. We were able to identify potentially selective CD151 binders, regardless of their individual frequencies in the total pool of sequences; thus, bypassing the limitations of standard NGS analyses based solely on enrichment counts of individual clone sequences, which has di culties for discriminating between selective Ab clones and background.
Filtering PCR-induced sequence artifacts improves the in silico Ab selection results As previously mentioned, PCR-induced artifacts may arise during the NGS sample preparation and Illumina sequencing process 40 . These artifacts represent invalid amalgams of existing CDRs L1, H1, H2, and H3 sequences, which may be seen as novel Ab clones 40 . These artifacts may signi cantly bias the frequencies of individual clones that will inevitably affect the in silico Ab discovery strategy. Therefore, for both the positive and negative pools, we obtained the frequencies (i.e. number of observations) of CDRs H3/L3 pairs, where both CDRs are the most diverse in terms of length and amino acid compositions in Library F 37 . We calculated a frequency cut-off to determine valid L3/H3 pairs utilizing a minimal occurrence threshold, with all invalid pairs ltered from the selection pool as potential PCR and NGS artifacts (see Materials and Methods). We therefore applied the motif-based in silico Ab discovery strategy to predict CD151 highly selective binders (p-values < 10 − 10 ) for both scenarios, before and after ltering. The application of error-ltering to the positive pool Abs reduced their clonal diversity to 80% less unique Abs (Fig. 4B1-C1-D1). Similarly, the application of the error-ltering before the motif-based in silico prediction of CD151 clones reduced their diversity to 85% less unique Abs (Fig. 4B2-C2-D2). Interestingly, before error-ltering the in silico predicted Abs clustered into 183 distinct families of similar L3/H3 sequences (> 80% identity), whereas after ltering the Abs reduced to only 4 distinct families (Fig. 4B3-C3 To experimentally assess the validity of predicted antibodies in both scenarios, we selected the Abs with best speci city scores (p-values; Fig. 4A3) from each of the 4 families predicted after ltering, as well as 23 additional Abs predicted before ltering (Fig. 4B3-C3-D3 & Table S1). Due to the low NGS enrichment of the identi ed Ab clones, instead of PCR rescue or similar methods 51, 52 , all 27 candidate clones were synthesized as Ab DNA sequences into Fab protein expression plasmids. After Fab puri cation, we tested the activity of each clone by ow-cytometry for binding to HEK293T-CD151 + cells when compared to HEK293T-CD151-cells (control). All four Ab clones predicted after ltering were determined as CD151 binders (Pass validation; Table S1), with uorescence signals of 3-fold or greater than controls. On the other hand, all 23 pre-ltering Abs failed to bind to CD151 (Table S1). The success rate of the motif-based in silico Ab discovery before and after ltering is respectively 4:27 (i.e. ~15%) and 4:4 (i.e. 100%). This difference between the success rates highlights the requirement to lter PCR-induced and NGS artifacts to derive accurately and effectively selective clones. Furthermore, the abundance (enrichment) of the 4 identi ed clones (based on motif-based in silico Ab selection) varied from high (30%) to extremely low. In fact, the clones CD151-2 and CD151-3 have frequencies below 0.01%, and clone CD151-4 possesses the extremely low frequency of 0.00009% (Table 1). These latter clones would be impossible to identify using manual sampling or standard NGS analyses solely based on enrichment.

Characterization of motif-based in silico identi ed Abs against CD151
To demonstrate the advantage of the motif-based in silico Ab discovery strategy, termed CellectSeq ( Figure S7), we measured all 4 clones (CD151-1 thru − 4) as Fab versions for dose-dependent binding to HEK293T-CD151 + cells. Quantitative ow cytometry displayed tight and saturable binding of each Fab to HEK293T-CD151 + cells (Fig. 5A), with EC50 values in the low-nanomolar range (Table 1). We also used ow cytometry to assess epitope overlap by measuring the ability of immunoglobulin (IgG) versions of each clone to block binding of each Fab to HEK293T-CD151 + cells. As expected, preincubation of HEK293T-CD151 + cells with each IgG reduced subsequent binding of the cognate Fab. Moreover, all IgGs blocked binding of the different Fab clones (Fig. 5B), implying that all four distinct clones share a similar CD151 binding epitope.
Further corroboration of speci city for CD151 was provided by performing immunoprecipitation mass spectrometry (IP-MS) experiments with each Fab for HEK293T-CD151 + cells and HT1080 cells (express native levels of CD151 protein). Tandem mass spectra were searched against a human database to validate MS/MS protein identi cations. Protein identi cations were accepted if they could be established at greater than 99% probability based on identi ed peptides. After background ltering to remove keratin, immunoglobulin and cytoplasmic proteins, the highest peptide counts for all four Fabs were for CD151 on both different cell lines ( Fig. 5C and S8). The integrin β1 (ITGB1), a receptor identi ed to associate with CD151 53 , also immunoprecipitated with Fabs CD151-1, CD151-3, and CD151-4 ( Fig. 5C), adding further validity of the Fabs selectivity for CD151. Taken together, the results show that the four in silico Abs recognize cell-surface CD151 with high a nity and speci city, with all different clones likely bind to overlapping epitopes. Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js Discussion Multi-domain membrane proteins represent about 70% of current drug targets, especially for their role in the progression and tumorigenesis of numerous cancers 14 . However, the cellular surface component of many integral membrane proteins makes their production and puri cation extremely di cult for in vitro Ab selections 13 . The instability of membrane proteins makes them challenging targets to work with, as many of these proteins depend on the membrane environment for their correct structure. The Ab selection strategy presented in this work, termed CellectSeq ( Figure S7), bypasses the need for puri ed antigens where Ab library selections are performed directly on cell-surface antigens. Moreover, CellectSeq may target di cult receptors, such as those containing minimal loop protrusions and present in complex mixtures in situ, as is the case of tetraspanin receptors.
By this method, we targeted the challenging CD151, a cell surface protein that is linked in the disease and progression of tumors. The in situ Ab selection against CD151, then followed by conventional manual Ab screening yielded a unique immunodominant clone CD151-1. While NGS enrichment ranking analysis of the same selection identi ed highly homologous clones to CD151-1, with greater than 80% sequence identity in both CDRs L3 and H3. On the other hand, the motif-based and error ltering in silico analysis of CellectSeq yielded multiple diversi ed clones, with multiple identi ed at extremely low frequencies in the output pool. Moreover, all four distinct paratopes identi ed by CellectSeq share a similar target epitope (Fig. 5B); this observation highlights the recalcitrant nature of CD151 receptor 35 , which displays limited surface exposed regions that limits the available epitopes (Fig. 1). Furthermore, the advantage of CellectSeq over conventional strategies of NGS derived Abs, such as enrichment ranking [54][55][56] , is the exhaustive analysis of all paratope motifs in the NGS dataset, rather than unique observations of clonal sequence identities, that enables the discovery of low abundant selective Abs. The statistical evaluation of paralogs allows for the successful prediction of representative Abs against CD151, including low abundance clones at observed frequencies as few as 7/7.3 million reads. We also accredit the success of CellectSeq to the design of the synthetic Ab repertoire itself. In contrast to existing strategies for enhancing sequencing delity in Illumina datasets [57][58][59][60] , as demonstrated in this report the restricted synthetic framework of Library F permits simple and accurate detection of erroneous Illumina reads. Here, the design of positional codon frequencies in the restricted CDRs allows for rapid deconvolution of NGS datasets and the removal of errors and artifacts, whereas natural repertoires prove more random and di cult to assess NGS errors. The synthetic framework also provides a deep analysis of paratope diversities in the NGS data by utilizing motif-based in silico strategies that predict infrequent but target-speci c Abs, which was demonstrated by the successful prediction of the CD151 diversi ed and selective Abs with frequencies below 0.01%.
The implementation of NGS analysis facilitates the rapid and successful discovery of Abs, and highlights that membrane associated antigens are accessible to synthetic Abs in situ. As demonstrated in this report, the strategy of CellectSeq surpasses standard methods of Ab manual screenings and NGS analysis to introspect all potential binders in the output Ab pool. Additionally, because NGS is an Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js attractive technology for generating meta-data, in regards to costs and access for Ab discovery 61, 62 , we foresee the expanded implementation of NGS in conjunction with CellectSeq to identify diversi ed paratopes and low abundance clones. This is evidenced by recent reports of deep sequencing approaches that characterize human Ab libraries and V-gene repertoires from immunized mice 63, 64 . This combination of Ab cellular selections coupled with NGS analysis and CellectSeq may create a proteomic gateway for modulating the surfaceome.

Materials And Methods
Cell lines and culture practices Both, the CD151 knockdown (HEK293T-CD151-) and CD151 overexpressing (HEK293T-CD151+) cell lines were gifts from the Dr. Rottapel lab at University of Toronto, Princess Margaret Cancer Centre. Brie y, the HEK293T-CD151-cells were generated using the Tet-pLKO-puro plasmid and the HEK293T-CD151 + cells were generated using the pLX304 plasmid, both as previously described 66 . The HEK293T cell backgrounds were cultured in Dulbecco's Modi ed Eagle medium (DMEM) with 10% fetal bovine serum (FBS). The human brosarcoma H1080 cell line (ATCC; CCL-121) was cultured in Eagle's Minimum Essential Medium (EMEM) with 10% FBS. All cells were cultured at 37ºC in a humid incubator with 5% CO2.

Antibody Selections with Cellular Antigen
Phage pools representing synthetic antibody library-F 37 were cycled through four rounds of binding selections using a HEK293T-CD151-cell line as the background depleting step, and a HEK293T-CD151 + cell line as the target selection step (Fig. 2). The adherent cell lines were suspended using PBS, 10 mM ethylenediaminetetraacetic acid (EDTA) (Sigma-Aldrich). For round 1, ten million re-suspended HEK293T-CD151 + cells (greater than 90% viability) were incubated with Fab-phage (3 × 10 12 cfu) in cell growth media under gentle rotation for 2 hours at 4 °C. For rounds 2 and 3, the Fab-phage were cycled between antigen negative (to remove non-speci c phage binders) to antigen positive cells. Here the Fab-phage were rst incubated with the HEK293T-CD151-cell line for 2 hours at 4 °C, then the cells were spun down utilizing a chilled centrifuge and Fab-phage supernatant collected. Similarly, the HEK293T-CD151 + cells were spun down utilizing a chilled centrifuge and supernatant discarded. Next, the HEK293T-CD151 + cells were resuspended utilizing the Fab-phage supernatant, and incubated for 2 hours at 4 °C. For round 4, both HEK293T-CD151-and HEK293T-CD151 + cells were independently presented with Fab-phage and incubated for 2 hours at 4 °C. The HEK293T-CD151-and HEK293T-CD151+, cell lines were washed four times with chilled PBS and 1% BSA. For all rounds, after washing the bound phages were eluted from the cell pellet by resuspending the cells in 0.1 M hydrochloric acid and incubating for 10 minutes at room temperature. The cell solutions were neutralized using 11 M Tris buffer (Sigma-Aldrich), cellular debris was removed by high-speed centrifugation, and the eluent was transferred to clean vials. The output phages were ampli ed by infection and growth in E. coli OmniMAX™ cells (Thermo-Fisher). After round 4, Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js infected E. coli OmniMAX™ cells were plated on 2YT/carbenicillin (Sigma-Aldrich) plates for isolation of single colonies.

Phage ELISAs
Colonies of E. coli OmniMAX harboring phagemids were inoculated into 450 µl 2YT broth supplemented with carbenicillin and M13-KO7 helper phage, and the cultures were grown overnight at 37 °C in a 96-well format. Culture supernatants containing Fab-phage were diluted two-fold in PBS buffer supplemented with 1% BSA and incubated for 15 minutes at room temperature. To test binding to native antigen on cells, phages were added directly to the cellular media of HEK293T-CD151-and HEK293T-CD151 + adherent cells (95-100% con uence) in tissue-culture-treated 96-well plates (Thermo-Fisher). After incubation for 45 minutes at room temperature, the plates were washed gently with PBS and the cells were xed with 4% paraformaldehyde (Sigma-Aldrich). The cells were washed with PBS and incubated for 30 minutes with horseradish peroxidase/anti-M13 Ab conjugate (Sigma-Aldrich) in PBS buffer supplemented with 1% BSA. The plates were washed, developed with TMB Microwell Peroxidase Substrate Kit (KPL Inc.), and quenched with 1.0 M phosphoric acid; the absorbance was determined at a wavelength of 450 nm. Clones were identi ed as positive if they produced at least three-fold greater signal on wells with HEK293T-CD151 + cells over antigen negative HEK293T-CD151-cells. All positive clones were subjected to Sanger DNA sequence analysis (Genewiz).

Fab Protein Puri cation
Fab proteins were expressed in E. coli BL21 (ThermoFisher), as described 38 . Following expression, cells were harvested by centrifugation and cell pellets were ash-frozen using liquid nitrogen. The cell pellets were thawed, re-suspended in lysis buffer (50 mM Tris, 150 mM NaCl, 1%Triton X-100, 1 mg/ml lysozyme, 2 mM MgCl 2 , 10 units of benzonase), and incubated for 1 hour at 4 °C. The lysates were cleared by centrifugation, applied to rProtein A-Sepharose columns (GE Healthcare), and washed with 10 column volumes of 50 mM Tris, 150 mM NaCl, and pH 7.4. Fab protein was eluted with 100 mM phosphoric acid buffer, pH 2.5 (50 mM NaH 2 PO 4 , 140 mM NaCl, 100 mM H 3 PO 4 ) into a neutralizing buffer (1 M Tris, pH 8.0). The eluted Fab protein was buffer exchanged into PBS and concentrated using an Amicon-Ultra centrifugal lter unit (EMD Millipore). Fab protein was characterized for purity by SDS-PAGE gel chromatography and concentration was determined by spectrophotometry at an absorbance wavelength of 280 nm.

IgG Puri cation
Full-length IgG proteins were expressed in mammalian cells, as described 67 . Brie y, plasmids designed to express heavy and light chains were co-transfected into Expi293 cells (ThermoFisher) using the FuGENE® 6 Transfection Reagent kit (Promega), according to the manufacturer's instructions. After 5 days, cell culture media was harvested and applied to an rProtein-A a nity column (GE Healthcare). IgG protein was eluted with 25 mM H 3 PO 4 , pH 2.8, 100 mM NaCl and neutralized with 0.5 M Na 3 PO 4 , pH 8.
Fractions containing eluted IgG protein were combined, concentrated, and dialyzed into PBS, pH 7.4. IgG Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js protein was characterized for purity by SDS-PAGE gel chromatography and concentration was determined by spectrophotometry at an absorbance wavelength of 280 nm.

Next-Generation Sequencing Analysis
The Fab-phage output pools from HEK293T-CD151-and HEK293T-CD151 + cell lines were utilized as input templates of PCR reactions using forward and reverse primers that anked CDRs L3 and H3, respectively. The primers included a 24 base-pair template annealing region followed by a 6-8 base-pair unique nucleotide barcode identi er and an Illumina universal adapter tag (PE1 or PE2 for the reverse or forward primer, respectively). Duplicate PCR amplicons were generated per Fab-phage pool that were then isolated by gel electrophoresis and followed by agarose gel extraction (Qiagen). The duplicate PCR amplicons were combined, and the sample concentrations were determined by spectrophotometry (BioteK). The amplicons for antigen positive and negative Fab-phage pools were normalized, pooled, and sequenced using a HiSeq 2500 instrument (Illumina) with 300 paired-end cycles. Besides PE1 and PE2 Illumina universal primers, the sequencing runs also included a custom primer that allowed for complete sequencing of CDRs H1 and H2. Thus, the three primer reads together provided complete sequence coverage of the four CDRs that were diversi ed in Library F 37 ( Figure S2). We performed duplicate NGS runs and then combined them. Subsequently, the sequencing reads were deconvoluted for each clone, and the three primer reads (PE1, PE2, and custom) were combined into a single sequence to derive the complete sequence. Sequences were ltered from sequencing errors using per base high quality score cut-off of Q = 30, which corresponds to 1:1000 of incorrect base call 43 . High quality nucleotide sequences were obtained, translated into amino acid sequences, and compared to the designed sequence repertoire of Library F 37 to lter out technical errors inherent to sequencing and PCR ampli cation.

Filtering hybridization errors in NGS selection pools
The diversity, i.e. combinatorial possibilities, of our phage-displayed synthetic Ab library 37 is dominated by the CDR sequences L3 and H3 ( Figure S1B). In fact, the theoretical diversity of L3 and H3 are to the Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js order of magnitude of 6 and 16, while H1 and H2 cover a diversity of 2 6 and 2 8 . Thus, we theoretically assumed that the majority of PCR-induced artifacts and bias representing amalgams of existing sequences, i.e. hybridization errors, to be present in the pairs of sequences L3/H3. In addition, we assumed that among all possible pairs of sequences L3/H3, the valid pairs are overrepresented compared to the invalid pairs. Thus, for every sequence H3 in the Ab selection pool, we obtained the frequencies (i.e. number of observations) of all its paired sequences L3, and we calculated a frequency cut-off according to the maximum interclass inertia method using the Koenig-Huygens theorem 68 . The cut-off serves as a minimum frequency threshold to identify valid pairs L3/H3, thus, all Ab sequences with the invalid pairs are ltered from the selection pool.

Enumeration of consensus motifs in the CDR sequences
Consensus motifs, or motifs, are utilized to represent the linear information that is shared among groups of sequences. While certain positions in the motifs are de ned (e.g. P as proline and R as arginine in the motif PXXR), others do not and are called wildcards (e.g. X as any amino acid in the motif "PXXR"). We utilize here the motifs to explore the linear information in the CDR sequences of each candidate Ab. To this end, we adapted the algorithm DALEL 46 that was rst developed to explore the linear information in proteins. To avoid the explosion of the number of motifs, we restricted the number of allowed wildcards in each motif to 55% of its length. In addition, we considered only motifs with wildcards matching more than one amino acid in the matching sequences in the positive pool (e.g. wildcard X in motif PXR matches amino acids Y and S in the sequences PYR and PSR). the means, s P and s N are the standard deviations, and n P and n P are the sizes, all respectively. We calculate the p-value to evaluate the statistical signi cance of the t-score following the procedure described below 48 − 50 .
We rst calculate the t-score by: We nally calculate the p-value by: Where t is the t-score, f is the degree of freedom, Γ(. ) is the Gamma function, and p is the probability that a single observation from the t distribution with f degrees of freedom will fall in the interval [-∞, t]. In other terms, the p is the probability to have by chance any t-score that is equal or below to the t-score t. Thus, the lower is the p-value p, the higher is the signi cance of the t-score t, and consequently the higher is the separation between the two distributions of frequencies in the positive and the negative selections.
We ltered the p-values using a stringent cut-off of 10 − 10 to identify highly speci c Ab clones. To  Table S1. Validation summary of CD151 selective Ab clones derived from NGS enrichment ranking analysis. The antibody sequences were synthesized as Fab protein and assayed for cellular binding on HEK293T-CD151+ cells via ow-cytometry. In-situ validation result "Pass" = uorescence signal 3-fold or greater than background (HEK293T-CD151-cells).