Abstract
With the availability of complete DNA sequences for many prokaryotic and eukaryotic genomes, and soon for the human genome itself, it is important to develop reliable proteome-wide approaches for a better understanding of protein function1. As elementary constituents of cellular protein complexes and pathways, protein–protein interactions are key determinants of protein function. Here we have built a large-scale protein–protein interaction map of the human gastric pathogen Helicobacter pylori. We have used a high-throughput strategy of the yeast two-hybrid assay to screen 261 H. pylori proteins against a highly complex library of genome-encoded polypeptides2. Over 1,200 interactions were identified between H. pylori proteins, connecting 46.6% of the proteome. The determination of a reliability score for every single protein–protein interaction and the identification of the actual interacting domains permitted the assignment of unannotated proteins to biological pathways.
During the past few years, interaction maps have been proposed for viral3, 4, 5 and eukaryotic (Saccharomyces cerevisiae2, 6, 7
and Caenorhabditis elegans8, for example) genomes. Here
we describe the first procaryotic interaction map, for which the strategy
used (Fig. 1) is also a variation of the two-hybrid
assay2. It differs considerably by both the type of results
generated and the throughput of the experimental procedures. We constructed
a library of random genomic fragments of the H. pylori strain 26695
that had been previously sequenced9. A high complexity
library was first obtained in Escherichia coli (over ten million clones).
Ninety-seven per cent of the plasmids contained a single genomic insert (mean
size 1,000
550 nucleotides). This library was then introduced
into yeast by transformation. Two million independent yeast colonies were
collected, pooled and stored at -80 °C as equivalent aliquot fractions
of the same library.
Figure 1: Outline of the strategy for building an H. pylori (Hp) proteome-wide interaction map.

A production database (the PIMBuilder) was built that contains information related to the genomic sequence of H. pylori, which codes for 1,590 putative proteins or ORFs. It was populated with raw data from screening experiments. The PIMBuilder tracks all biotechnological or bioinformatics operations performed during the production processes, stores information about all biological objects produced during experiments, and interfaces with robots and bioinformatics modules. It also implements the procedure used to construct PIMs from raw experimental data. After identification of almost all positive clones, overlapping prey fragments were clustered into families to define SIDs. Those families that had no biological coding capability (antisense or intergenic region, out of frame fragments occurring in a single frame) were discarded. A PIM biological score (PBS; see Methods) was then calculated for H. pylori ORF-encoded SIDs. Interactions were grouped into categories A to D (from high to low heuristic values). The global connectivity of the PIM was also analysed to detect highly connected prey polypeptides. Those interactions were grouped in the E category. Processing of data and visualization of interactions were performed by an in-house bioinformatic platform (PIMRider).
High resolution image and legend (65K)In parallel, a large set of bait plasmids was constructed in a bait vector designed to decrease the level of transcriptional auto-activation2. Bait constructs were specifically adapted for interaction screens, yielding 'validated baits'; for example, hydrophobic putative trans-membrane domains were discarded to avoid any non-nuclear localization of the bait protein in the yeast cell. In some cases, specific open reading frame (ORF) domains were selected for the bait design. For every single bait construct, we performed a preliminary small-scale screening experiment that determines the selective pressure (that is, modifying the selective medium, see Methods) to be applied to obtain a small number of independent positive clones per million interactions tested (usually less than 10). All positive colonies were then picked, and prey fragments were individually identified by sequence analysis and comparison with the genomic database through a dedicated integrated laboratory production management system (Fig. 1).
Protein–protein interaction maps are built on experimental data that ideally yield a heuristic value for each connection. Our procedure involves several steps of processing of raw two-hybrid results (Fig. 1). First, positive prey fragments are clustered into families of overlapping fragments. The common sequence shared by these fragments is referred to as the selected interacting domain (SID). Second, SIDs that do not code for part of an H. pylori ORF are discarded. Third, for every remaining SID, a PIM biological score (PBS) is computed. The PBS is based on a statistical model of the competition for bait-binding between fragments. It is computed like a classical expected value (E value), and ranges from 0 (specific interaction) to 1 (probable artefact). For practical use, the scores were divided into four categories, from A (score very close to 0) to D (close to 1). A fifth category, E, was added to distinguish interactions involving only highly connected prey domains (SIDs found as prey with frequency greater than a fixed threshold). These are most probably two-hybrid artefacts. Although they may have some biological significance, they add little specific information to the interaction map. It should be emphasized that domains, rather than whole proteins, are tagged. For example, the carboxy-terminal region of protein HP0705/UvrA was found as a prey in 16 different interactions that were scored in the E category, but the same protein was selected through another domain as a specific interacting prey. Because global connectivity is taken into account, the PBS is computed incrementally over the whole PIM and its discriminatory power increases as screening results accumulate.
We carried out 285 screens on 261 H. pylori bait ORFs chosen as follows: first, a core set of 50 proteins known to be involved in complexes and/or in pathogenicity was used to validate the approach; and second, 211 baits were then picked randomly, with a slight bias toward regions of the proteome that were still unexplored. Positive colonies (13,962) were selected (Table 1), and more than 95% of them were identified by sequencing the prey insert. From these prey fragments, 2,680 independent SIDs were defined, out of which 1,100 fell into non H. pylori ORF coding regions and were discarded (Fig. 1). Thirty-one SIDs were classified in the E category. The remaining SIDs define 1,280 interactions including 62 homo-oligomeric interactions. In total, 46.6% H. pylori proteins of the proteome were connected, which corresponds to an average connectivity of 3.36 partners per connected protein (without counting homodimeric connections). Only 14 screens out of 285 yielded no positive clones or interactions with nonsignificant score values, illustrating the fact that our technology reduces the rate of false negatives.
Given that little information is available on protein interactions in H. pylori, we used a data set of known interactions in E. coli to validate further our experimental strategy, and to evaluate the correlation between the PBS and the actual biological significance of interactions. For each H. pylori protein present in the interaction map, significant E. coli orthologues (FastA score < 0.01) were selected and their annotations in the SwissProt database (release 38)10 were verified manually for known interactions shown by various biochemical means. The resulting E. coli interaction list was compared with interactions found in H. pylori (Fig. 2). Among these E. coli interacting pairs, 53% of the homodimers and 67% of the heterodimers were found. Among heterodimers, five out of six negative pairs were tested only in one direction, suggesting that performing the reciprocal screens would decrease this number. Most interactions that were described for orthologous proteins in E. coli fell into the high-scoring interaction category (A) according to our PBS calculation (7/10 homodimers and 9/12 heterodimers) confirming the heuristic value of the classification. The interaction map was also analysed according to the classification of H. pylori proteins into 14 functional categories previously proposed9. For 10 categories out of 14, more intra-category protein–protein interactions were observed than expected from a random theoretical distribution (Z-scores ranging from 2 to 50; in all cases, P < 0.05), suggesting the existence of a significant correlation between functional grouping and detection of interactions.
Figure 2: Sets of E. coli interaction data for which H. pylori orthologous proteins were identified and assayed in interaction screens.

a, Homodimers; b, heterodimers. Names of E. coli proteins are boxed. Unidentified interactions between H. pylori orthologues are scored as '-' and shown in white boxes. Identified H. pylori interactions are indicated in grey boxes with their PBS score category (A, B, C or D). When several orthologues were identified, only the best scoring homologue was considered. For heterodimers, arrows indicate the direction of the screen (bait to prey) that was performed. The black (or white) colour of the arrow indicates a positive (or negative) result in the interaction screen.
High resolution image and legend (49K)To display and analyse the interaction data, we developed a software platform composed of a database, a web-based graphical interface layer and various query and analysis tools (the PIMRider, Fig. 3; see also http://pim.hybrigenics.com). Starting from a gene name (or an ORF name), the PIMRider draws an automatic layout of the neighbourhood of this protein in the protein interaction map (Fig. 3a). Paths connecting two proteins can also be queried. Connections are displayed with their PBS scores, and can be filtered according to score categories. A graphical summary of information describing all interaction domains within a given protein can be displayed (Fig. 3b). Raw data on every interaction can be retrieved. Finally, the PIMRider supplies a description of each gene with functional and genomic information, and includes links to significant bibliographic references and to relevant external databases, such as PyloriGene (http://genolist.pasteur.fr/PyloriGene/). PyloriGene is a manually annotated database of the two H. pylori sequenced genomes9, 11, that integrates all publicly available information on genes and proteins and has been elaborated with a structure similar to that of the Bacillus subtilis SubtiList database12.
Figure 3: PIMRider screen shots.

a, The PIMViewer displays a portion of the protein interaction map around the CheA protein. Links between proteins identify connections with their colour-coded PBS score; b, the MultiSID Viewer exhibits the various interacting domains in the CheA protein (for details see http://pim.hybrigenics.com).
High resolution image and legend (50K)Exploring the protein–protein interaction map reveals biological pathways and allows prediction of protein function. A first example concerns chemotaxis (see Fig. 3a). The H. pylori genome reveals three homologues of E. coli proteins that are involved in the chemotactic pathway (CheA, CheW and CheY) and proteins such as TlpA similar to chemotaxis receptors (MCPs). CheA was found to interact with CheY and CheW, and distinct interacting domains were precisely identified ( Fig. 3b). The domain of CheA which binds to CheY precisely overlaps with the interacting domain assigned by a structural study in E. coli13. The TlpA-binding site for CheW was localized in a domain known, in E. coli, to be methylated and implicated in the transduction of the chemotactic signal.
The urease complex was also examined. Urease activity is essential for H. pylori pathogenicity and its synthesis requires two structural subunits, UreA and UreB, and the product of four accessory genes: ureE, ureF, ureG, ureH14. Complexes between accessory proteins and their role in nickel incorporation at the urease active site have been described for orthologues, but little information is available for H. pylori (for review, see ref. 15). The protein interaction map revealed the connection between UreA and UreB, and one of the two expected homo-oligomeric interactions of structural subunits (UreA); the UreB homodimer could not be detected. A connection between accessory proteins and the structural subunits occurs via UreH and UreA, which is consistent with the presumed chaperone role of UreH15. A new structural link was found between UreG and UreE. The UreF and UreH proteins were connected, but no connection between UreG and UreF or UreH was detected. In addition to the accessory proteins, the urease operon codes for an inner-membrane protein, UreI, essential for resistance to acidity16 and recently described as a H +-gated urea channel17. The third cytoplasmic domain of this protein reveals a potential interaction with the ExbD protein which is involved in transmission of PMF (proton motor force) energy to outer-membrane receptors.
Combination of genomic and proteomic data also permits function prediction. The H. pylori proteome contains a homologue of the E. coli HolB protein. In E. coli, this protein interacts with HolA to form part of the DNA polymerase core18. We found one high-scoring interaction between H. pylori HolB and an uncharacterized polypeptide, HP1247. A pairwise alignment between E. coli HolA and HP1247 highlighted structural homology (Fig. 4) not found by previous sequence analysis (Fig. 2). We thus assign the HolA function to the H. pylori HP1247 protein. The organization of bacterial genomes into operons suggests a functional relationship between the corresponding gene products that can be directly compared with our protein interaction map. Indeed, we found interactions between proteins that were likely to be expressed from a single operon (Table 2). Among these, we detected interactions between proteins known to interact in H. pylori (ScoA-ScoB) or in other organisms (RpsR–RpsF, MoaE–MoaD, FtsA–FtsZ), and between polypeptides involved in the same enzymatic activity and not yet described as interacting (HypE–HypF).
Figure 4: Alignment between H. pylori HP1247 protein and E. coli HolA.

The two sequences were aligned using the FastA algorithm. Identical (black) and similar (grey) residues are outlined. The position of HP1247 C-terminal domain interacting with HolB is indicated by a line.
High resolution image and legend (66K)Selected interacting domains can also be analysed in terms of protein structure.
The prokaryotic RNA polymerase, composed of a core enzyme (
2
')
associated with a
-factor, is one of the best studied multisubunit
enzymes. In H. pylori, the
- and
'-subunits usually
found in other bacteria are fused into a single polypeptide (RpoB). One of
the two alternative
-factors present in H. pylori, (HP1032),
is similar to the E. coli FliA protein (sigma 28) necessary for transcription
of genes involved in flagellar biosynthesis19. We identified
a precise region of RpoB that interacts with the H. pylori FliA. This
selected interacting domain (residues 841–959) maps exactly to a structural
domain called the flexible flap20. The RpoB-interacting domain
of FliA falls in the regions 3.2–4.2 of this
-factor (residues
175–255). Biochemical studies suggest an interaction between the flap
domain of RpoB and region 4 of sigma factors20. Our experiments
thus characterize this interaction and support the role of the flap domain
and
-factors in the transition from an open complex to a processive
elongation complex21.
Our work provides a way to characterize proteome-wide protein interaction maps. Our results identify complexes that have been shown or postulated to exist in other organisms, such as in E. coli. They also complement sequence information about homologous proteins and operon prediction from the location of genes on the chromosome. Finally, identified interacting domains could be mapped on three-dimensional structures of proteins. As a whole they lead to the assignment of a functional role for many as yet uncharacterized proteins and provide tools, such as interacting domains, for further biological experiments. Our technology was designed to specifically address the main known causes for false-negative and false-positive results in two-hybrid assays. It is now clear that interactions are not necessarily detected positive by two-hybrid assays when used in reciprocal directions (Fig. 2; see also refs 4, 8, 22). Parallel screening against highly complex libraries of fragments increases the number of 'two-hybrid-capable' candidates and reduces the rate of false negatives that arise with the classical two-hybrid matrix approach (that is, pair wise testing of a collection of proteins6, 7). Concerning the false positives, the specific design of selection procedures that permit a strong selectivity for all baits and the statistical analysis made possible by the experimental procedure (that is, reproducible exhaustive screening of fragment libraries) allows us to detect and tag nonspecific partners through a global scoring scheme.
Ultimate validation of biological significance should come from additional biological information, but comparison of our results with previously described interactions of H. pylori proteins and also of orthologous E. coli proteins supports the reliability of the approach. Protein interaction maps can be built at the scale of a proteome. Our technology is applicable to higher eukaryotes, for which highly complex random-primed complementary DNA libraries are screened for interacting domains. The identification of interacting domains is a direct consequence of the library screening approach and presents key advantages such as mapping of new functional domains or correlation between sequence similarity and functional homology. These interacting domains also constitute a first step towards the construction of dominant-negative mutants or the development of an assay for interaction modulation, applicable to new drug design.
Methods
Bait cloning
Baits were constructed by PCR amplification
and cloning in the pB6 plasmid derived from the original pAS2
(ref. 2). Design of the primers was automatically
proposed by a software and validated for each ORF. PCR fragments were cloned
by classical enzymatic methods in a 96-well-plate format. All bait constructs
used for interaction screens were fully sequenced and compared to the genome
sequence; any mutant clone was discarded.
Library construction
We extracted genomic DNA from H. pylori 26695 as described23, and nebulized and blunted it with a cocktail of mung bean nuclease, T4 and Klenow polymerase (NEB). Adapters containing Sfi1 sites were ligated to blunt DNA. The adapted DNA was cloned into pP6 plasmid derived from the original pACT2 and transformed in E. coli (DH10B; LifeTechnologies). Sequence analysis was performed on one hundred randomly chosen clones to establish the general characteristics of the library.
Screening procedure
The principle of the technique has been described2. Briefly, the screening conditions were adapted for each bait during a test screen, before performing the full-size screening experiment. The selectivity of the HIS3 reporter was modulated with 3-aminotriazole (3AT). For about 15% of the screens, diploid cells were plated on selective medium containing 3AT. In 44% of the screens, the second reporter (lacZ) was used directly on plates for the selection of clones. In other cases, the lacZ reporter was used as a second round of selection only for the selected clones. In all cases, LacZ activity was measured in a quantitative luminometric assay (Tropix).
Identification of interacting fragments
The prey fragments of the positive clones were amplified by PCR, analysed on agarose gel, partially sequenced at their 5' junction on a PE 3700 Sequencer and mapped on the genomic sequence. Many clones were sequenced at their 3' junction to map precisely the SID. All the steps after picking of positive clones were performed in bar-coded 96-well plates and automated with Beckman Biomek 2000 and Multimek automats. At the end of each screening experiment, the identity of the bait plasmid was controlled on a few positive clones.
PIM biological score
The PIM biological score (PBS) computation relies on two different levels of analysis: first, a local (that is, taking into account only the results of one screen) score is computed for each screen; and second, the global score is computed from the local scores by integrating results from all screens performed within the same genomic library. Local scores are thus computed only once, while global scores are recomputed each time new screens are performed. For each screen, fragments are clustered by overlap to delimit SIDs. Fragments that have no or very improbable coding capability (antisense, intergenic region, and out-of-frame fusion fragments selected in a single frame) are then eliminated from the set of prey fragments identified from positive clones. Assuming that prey fragments compete for the bait with 'equal chances', the probability p for a given fragment to be selected in an experiment is proportional to its expected number of occurrences within the library. p is computed as a function of the fragment length and position, and of the length and position distributions of fragments in the prey library (these distributions are calibrated using data from random sequencing).
The local score is the probability for a given SID to be obtained under the equal chance hypothesis, that is, as a result of random noise. It is deduced by combining probabilities p (using a binomial law) from each of the independent fragment defining it. A (global) PBS is computed for each protein interaction after pooling results from all screens. On the basis of an independence hypothesis, scores from different screens are combined together when the same protein domain pair is involved. The resulting PBS thus represents the probability that the protein–protein interaction is due to noise. Scores are real numbers ranging from 0 to 1, but are grouped in four categories (A, B, C and D) for practical purposes. Finally, the global connectivity of the interaction map is analysed to tag separately (category E) SIDs found as prey with frequency greater than a fixed threshold: the PBS of each protein–protein interaction involving highly connected SIDs is set to 1. Both the intercategory thresholds and the high-connectivity threshold were defined manually, taking into account the nature of the studied organism, the relevant library and the current coverage of the proteome (A < 1e-10 < B < 1 e-5 < C < 1e-2.5 < D; the E category corresponds to prey SIDs selected with more than 4 baits and was arbitrarily attributed a PBS value of 1).
Bioinformatics
Several algorithms and software were implemented in the production database to facilitate experimental steps, such as a 'bait program' that designed automatically oligonucleotides for PCR amplification and sequencing of bait constructs, a 'prey program' that determined the position of each fragment in the genome and its coding capacity (such as intergene, antisense, nucleotide position in an ORF, coding frame). The interactions were then analysed through a web-based software platform, the PIMRider developed at Hybrigenics and accessible through the web interface (http://pim.hybrigenics.com). Academic users will be granted a free licence. Other users will have to purchase a commercial licence. The H. pylori PIMRider platform is linked to the PyloriGene database.
