In the recent decision Association for Molecular Pathology v. Myriad Genetics1, the US Supreme Court held that naturally occurring sequences from human genomic DNA are not patentable subject matter. Only certain complementary DNAs (cDNA), modified sequences and methods to use sequences are potentially patentable. It is likely that this distinction will hold for all DNA sequences, whether animal, plant or microbial2. However, it is not clear whether this means that other naturally occurring informational molecules, such as polypeptides (proteins) or polysaccharides, will also be excluded from patents.

The decision underscores a pressing need for precise analysis of patents that disclose and reference genetic sequences, especially in the claims. Similarly, data sets, standards compliance and analytical tools must be improved—in particular, data sets and analytical tools must be made openly accessible—in order to provide a basis for effective decision making and policy setting to support biological innovation. Here, we present a web-based platform that allows such data aggregation, analysis and visualization in an open, shareable facility. To demonstrate the potential for the extension of this platform to global patent jurisdictions, we discuss the results of a global survey of patent offices that shows that much progress is still needed in making these data freely available for aggregation in the first place.

Mapping the patent landscape

There have been numerous studies and publications about the scope, social or economic impact, and policy and practice implications of patenting of biological sequences—commonly known as 'gene patenting'3,4,5,6. Many of these studies contain incomplete data sets, use analytical tools that cannot distinguish the nature of the sequence similarities or fail to parse and analyze patent claims. Few of these studies make the primary data available in a form allowing review by others. In an effort to move beyond opinion pieces and to provide a facility that can be used to ask and answer specific questions in an open, verifiable manner, we have created a biological facility (http://www.lens.org/lens/biological_search) as a public resource within 'The Lens,' an open, global cyber infrastructure dedicated to increasing the efficiency and fairness of the innovation system by making access to patent documents more transparent and inclusive. We have used this facility to create tools that allow for dynamic mapping and shared analysis of the scope of patenting over several genomes, beginning with the human genome.

The single most important consideration in gene patenting is the critical difference between disclosure of sequences and claiming of sequences. Before the recent decision in Myriad, the literature on gene patenting led to more confusion than insight. Although some claim that the concerns about gene patents were exaggerated and based on outliers and wrong perceptions5,7,8, others maintain that patent protection for genetic sequences was excessive and led to obvious inventions, questionable patents and opaque innovation systems that may have harmed the integrity of the market and constrained scientific progress9,10,11,12,13.

Within 'The Biological Lens' facility, the sequence database currently holds 147,565,858 million nucleotide and amino-acid sequences disclosed in 323,721 global patent documents comprising both applications and grants. Of these sequences, 67% are repeated at least once in the corpus. Some level of redundancy is to be expected, as the same sequence may be either referenced in a single patent document for different purposes or mentioned in many related or unrelated patent documents. Although a majority of patent documents list only one or a few sequences, a substantial number list thousands or even millions of sequences. For example, US Pat. No. 7,777,022 discloses 4.2 million sequences. As millions more sequences become available, patent offices face a difficult challenge to render that information accessible to and useable by the public.

Major patent offices claim to have sophisticated search tools and databases that likely comprise a very substantial set of sequences; however, information about the effectiveness of these algorithms and the scope of these sequence databases are not generally available to the public, and they may even be off limits to the dozens of patent offices in jurisdictions with emerging intellectual property (IP) protection or with limited budgets. Some commercial vendors claim to offer comprehensive data and sophisticated analysis, but this is an expensive means of accessing what is fundamentally public information, and provides one of many entry barriers that disadvantage small-to-medium enterprises (SMEs) and innovation-focused and impact-driven public sector and philanthropy. In addition, these commercial databases are incomplete14. For example, the millions of sequences published in the US Patent and Trademark Office's (USPTO; Washington, DC) Patent Applications since 2001 are not incorporated within GenBank (http://www.ncbi.nlm.nih.gov/genbank/), or in any other global public facility, yet as published data they must clearly be considered as potential prior art. This lack of access is also problematic for patent applicants, who may not know whether the sequence for which they seek protection has been previously claimed or not.

To provide a basis for better understanding the complex landscape of gene patenting, we have mapped patent-disclosed sequences onto the human genome and developed a patent-sequence (PatSeq) toolkit to find, align, browse and explore these sequences. To illuminate the scope of patenting of known genes on the human genome, we selected those mapped sequences referenced in granted claims (GC) of the USPTO and performed homology-based analysis with three publicly available transcriptome or proteome data sets: RefSeq15, GENCODE16 and Ensembl's Consensus CDS17. We found that the percentage of known genes referenced—not necessarily claimed—ranges from 26% to 62%, depending on the reference data set and the homology threshold chosen.

Mapping of patent-disclosed sequences onto the human genome

Because many patent claims provide rights over sequences with as little as 70% identity to a disclosed sequence (for example, US 7,229,976, Claim 1 or US 7,919,474, Claim 2), we selected a range of homology thresholds to determine alignment and location of candidate sequences on the human genome. Homology thresholds were specified by two metrics: patent sequence similarity, and coverage in proportion to the sequence length. The similarity rate reflects the number of matching nucleotides between the patent sequence, and the reference genome and the sequence coverage reflects the proportion of the patent sequence that was included in the alignment. Because of the high repeat rate in the sequence listing corpus, a non-redundant data set of patent sequences was used for the mapping against the reference human genome (assembly GRCh37) of the GRC (http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml). For mapping highly homologous genomic sequences, we used the Burrows-Wheeler Aligner suite18. Potential mRNA and protein sequences were mapped to the reference genome using BLAT19.

Under stringent conditions (100% similarity and 100% coverage rate), 15.6 million sequences were matched to the human genome at one locus or more (Fig. 1a). These correspond to 31.4 million sequence listing entries after reintroduction of the redundancy in the corpus. Although the majority of these sequences were declared within the patent as of human origin (Fig. 1a), 20% were unspecified or annotated as unknown, artificial or sequences derived from other organisms. In the granted patents, 131,339 nucleotide and 15,054 amino acid sequence listing entries were referenced in the claims of 13,985 US patent documents by August 2012 (Fig. 1b).

Figure 1: Patent sequences mapped on the human genome (GRCh37 at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/index.shtml).
figure 1

(a) Mapping was based on various similarity and query length coverage rates (90% 50%, 98% 95%, and 100% 100%). Unique patent sequences refer to sequences with only unique mapping locus. Although the majority of nonredundant sequences were declared human in the patent documents, around 20% were unspecified or nonhuman. (b) The internal chart shows only mapped sequences that are referenced in the granted claims (1% of the data); redundant sequence counts (R), nonredundant sequence counts (NR) and their corresponding patent grants counts. (c) Homology-based human transcriptome and proteome analysis based on two filters of E-value and percentage of minimum hit coverage.

Both sequence listing entries were then compared against three public transcriptome and proteome data sets: RefSeq (24,592 genes), GENCODE v16 (20,564 genes) and the April 30, 2013 release of CCDS (18,552 genes). The analyses revealed that 26% to 62% of known human genes are referenced in the granted claims. We obtained this wide range mainly because of the chosen Expect value (E value) and minimum hit coverage percentage and the data set used for comparison (Fig. 1c). For example, under our most stringent condition (75% minimum hit coverage and E value of 0), the percentage of known genes was calculated as 26% based on RefSeq, 32% based on GENENCODE v16, or 37% based on CCDS data set, whereas with 50% minimum hit coverage and 0.001 E value, the percentage of known genes reached 49%, 57% or 62%, respectively.

A 2005 paper by Jensen and Murray20 forms the basis for the widespread assertion that 20% of the human genome had been patented before Myriad20,21. But as Holman observed5, Jensen and Murray's 20% coverage conflated genetic sequences that were merely referenced in patent claims with genetic sequences that were explicitly claimed. In 2008, Cook-Deegan provided a more conservative estimate that 3,000 to 5,000 human genes had been patented in the United States6. How can we differentiate disclosed from claimed sequences and provide public tools to clear perceptions and enable navigation of complex gene patents?

Claimed versus disclosed sequences

After optimizing and extending the algorithms, which select patent documents that reference a sequence in the claims22, typically though not exclusively as 'SEQ ID NO', we analyzed initially the claims associated with the fully aligned 131,339 nucleotide sequence entries referenced in 2,716 patents. We found that 76,910 sequences mapped uniquely with 100% homology to the human genome and corresponded to 2,685 patents, whereas the remaining 54,829 sequences were simply repeated in duplicated versions of patent documents. Claims referencing the unique sequences were then individually and manually analyzed. Analysis of grants that reference the amino-acid sequence entries in the claims and that may potentially encompass additional nucleotide sequences were not included in this analysis, but will be shared online in an upcoming patent landscape on our website.

Under Myriad, these fully aligned sequences would all be considered natural nucleotide sequences from the human genome, and thus potentially nonpatentable as they do not include those with excised introns (cDNAs) in the patent sequence.

We categorized the sequences in the claims on the basis of the role of that sequence in the claim. We created ten claim categories; their detailed description and distribution by patent as well as by sequence are depicted in Figure 2. Among granted US patents, the distribution is fairly even (Fig. 2a). Patents that actually claim, as opposed to merely reference, the sequence, comprise the largest single category of patents analyzed, but this accounts for only a third of all documents. The remaining two-thirds of the patents analyzed use the referenced sequences in a claimed method, or claim them in combination with other sequences or compositions, but do not claim the sequences by themselves.

Figure 2: Categorization of patent sequences referenced in the claims.
figure 2

(a,b) We categorized 76,910 unique patent sequences that map with 100% homology to the human genome and are referenced in the claims of 2,685 granted US patents according to the role of the sequence in the claim. In the distribution based on patents (a), if a patent had different sequences in multiple categories, the patent was counted more than once. However, for the distribution based on sequences (b), if the sequences were referenced in two different contexts in the same set of claims, they were categorized only once, according to the broadest category. For example, a sequence referenced both as a primer for use in a method claim and also claimed as a primer sequence would be categorized as 'Sequence claimed.' We used ten categories. 'Sequence claimed' includes claims for isolated nucleotide sequences, sequences specifically claimed as primer or probe sequences and antisense sequences. 'Subpart' includes sequences that are part of a larger sequence, sequences that are one of several sequences claimed as a set, and sequences claimed alongside nonsequence substances, such as a pharmaceutical carrier. 'Alter phenotype' includes sequences in method claims in which the sequence is used to alter a cell, tissue or organism. 'Comparison/target' includes sequences in claims that employ the sequence in a comparison (e.g., of methylation or expression), in a screening assay or as a target for some claimed product or method. 'Probe or primer used in a method claim' includes sequences that are referenced as a probe or primer to be used in a claimed method. 'RNA/double-stranded' includes sequences that are specifically claimed as RNA sequences along with sequences that are claimed as double-stranded. 'Artificial' includes sequences claimed that differ from the wild-type version of the sequence (so that manipulation of the wild-type sequence would be unlikely to infringe the claim), as well as sequences that are identical to the wild-type sequence, but with chemical modifications to the backbone or sugar residues. 'Peptide' includes claims for a peptide made with reference to the coding nucleotide. 'Reference' includes sequences that are generally referenced either as a placeholder or in the negative. 'Submarine' sequences are claims wherein the sequence ID is not referenced in the claim but the patent claims a broad set of sequences (e.g., US 6258540). By way of comparison, Merz et al. write that “Gene patents cover three distinct types of invention: diagnostics, compositions of matter and functional uses.”10 Our categories do not correspond perfectly to Merz's, but generally Sequence claimed, Subpart, RNA, Artificial, Peptide and Submarine categories all correspond to compositions of matter. Comparison/target corresponds to diagnostics, Alter phenotype corresponds to functional uses, and Probe or primer used in a method corresponds to either diagnostics or functional, depending on the specific claim. (c,d) Profiles of the major categories are depicted based on publication date (c) or filing date (d).

The distribution of sequences per category revealed that only 13% of the examined sequences are actually claimed as sequence (Fig. 2b). Some categories show a relatively higher percentage of sequences referenced because of a small number of patents that each reference (but do not directly claim) a very large number of sequences in the claims. For example, 40% of the patent sequences were categorized as 'Comparison/target,' largely because of the effect of a single patent, US 7,510,834. Claim 1 of this patent references 27,088 sequences, of which 23,820 mapped with 100% homology to the human genome and so were included in our analysis. Similarly, many of the sequences categorized as 'Subpart' and 'Alter phenotype' come from the claims of three related patents: US 6,936,467, US 7,226,785 and US 7,258,854. Each of these three patents reference the same series of 4,192 sequences in the claims, of which 4,188 mapped with 100% homology to the human genome and so were included in our analysis.

To illustrate the complex and evolving dynamic in patent sequence claiming, we plotted the major categories as they have risen and fallen over the past 20 years using either publication date (Fig. 2c) or filing date (Fig. 2d). As speculated in the literature23,24,25, we found that most sequences were claimed around the time when the human genome sequencing project was being completed but just before the public release of its complete sequence in GenBank, after which the number of claims made was dramatically reduced.

We were particularly interested in the 'Sequence claimed' category because this category consists of claims that are potentially invalid in view of Myriad. As noted above, in this study we only looked at sequences that map completely to the genome with no gaps, meaning that none of the sequences we looked at fall within the Supreme Court's exception for cDNA that spans an intron. The examined sequences in this category were often claimed as primer sequences or probe sequences, but the putative use of a claimed sequence would not have mattered in an infringement context before Myriad, and now it may not matter for purposes of validity. Similarly, many of the sequences are antisense sequences that are probably ineligible for patenting as the Supreme Court made no distinction between forward-reading and antisense sequences in its analysis.

To provide some context, we plotted claimed sequence counts by applicant and publication date along with events that may have affected genetic sequence patents in the United States across the x axis. Moreover, we examined the legal status of these patents, determined the percentage of those that have already expired, and displayed the information as a timeline (with sequence counts indicated for each year; Fig. 3).

Figure 3: Sequences claimed by applicant and year.
figure 3

Patent sequences, which were categorized as 'Sequence claimed,' are shown by applicant and by the year in which the patent claiming the sequence was granted by the US Patent and Trademark Office. Some relevant legal and regulatory events affecting the patenting of genetic sequence in the US are also shown across the x-axis, and percent inactive patents per year is also depicted on top of each bar. Patent sequences claimed from 2004–2012 are also shown in the internal chart to allow for differences in scale between the number of claimed sequences before and after 2003.

A comprehensive treatment of these events is beyond the scope of our study, but in the early 1990s, with the controversy of expressed sequence tag (EST) patenting, the number of sequences being claimed remained low until early 1996, when it started to increase at a slower rate. It picked up again in 1999 and reached a peak in 2002.

Once the Human Genome Project (HGP) was announced to be complete in 2003 (ref. 26), the number of sequence claims began to decrease. Of course, one interpretation is that the perceived high-value genes and loci had been patented before this time, as they would have been subject to intensive investment and scrutiny. An alternative explanation is that new business values became possible then, because of the sequence annotations efforts that led to increased claiming in other categories, such as altered phenotype, subpart or comparison/target (Fig. 2c,d).

Although all the 927 patents we analyzed that contain claimed sequences are probably now invalid under Myriad, we were curious about the perception of value that the owners of these patents previously assigned to them. We looked at failure to pay maintenance fees as a proxy for the patent owner's perception27. In the United States, maintenance fees are due at about 4-year intervals after the patent is granted and registered. We examined the percentage of expired US patents at 4, 8 or 12 years (whenever information was available) post granting and found that 30–33% of sequence patents were not maintained for their full potential lifetime. The percentage reported in Figure 3 reflects expirations per year at 4, 8 or 12 years post granting. To investigate the matter further, we grouped the patents on the basis of the applicant type (corporation, government, hospital, university or individual) and inferred the value of claimed sequences from these patents to their applicants. Our findings indicate that 47% of the sequences claimed by hospitals or 43% of the sequences claimed by universities are associated with expired patents, whereas only 13% of the sequences claimed by corporations are from expired patents (Table 1).

Table 1 Percentage of claimed sequences in expired patents in the collection of 927 patents that contain claimed sequences and which map with 100% homology onto human genome

These results likely reflect the differing perceptions of patent value and models for their use between public and private institutions. Although the former commonly aspire to direct revenue generation from such patents—either through licensing or spinoff—private enterprises must consider additional values, such as defensive, deterrent, cross-licensing and signaling drivers in their business positioning and partnership development, and thus are more likely to bear the costs of maintaining patents during their whole life cycle.

Isis Pharmaceuticals (Carlsbad, CA) applied for and received patent protection for far more nucleotide sequences than any other entity from 1999 until 2003, accounting for at least 61% of the total claimed sequences, with 2,285 sequences granted in 2002 alone. Unfortunately, until recently the USPTO rules did not obligate public registry of patent assignments; it is therefore very difficult to ascertain who owns what patents and whether all Isis patents, for example, are actually owned by Isis; licensing status is even more obscure. However, initiatives such as USPTO's request for comments for more complete patent assignment information28 and the Executive Order from the US President to render patent ownership more clear29 could substantially change this situation.

PatSeq toolkits

We have developed a suite of evidence-based public tools that will allow any interested party to investigate and navigate patent-disclosed sequences within the context of their metadata and patent claim rights. We have introduced into 'the Lens' several indicators to identify patent documents containing a sequence listing. After a search is carried out and if a patent document contains a sequence listing, we insert a sequence tab within the document portfolio that clarifies the nature of the disclosed sequences and provides information, if available, about their metadata (nature of sequence, length, origin of organism) and potentially their redundancy level, location within the document where the sequence is referenced, and the source from which the sequence was downloaded. For further analyses, we have also created the PatSeq Finder, PatSeq Explorer and PatSeq Analyzer tools.

PatSeq Finder allows users to query any sequence against the PatSeq databases and conduct sequence similarity searches based on BLAST version 2.2.28 (http://www.ncbi.nlm.nih.gov/books/NBK131777/). Search results are aligned based on a score of relatedness to the original query and sequence information is depicted with that of the corresponding patent document. Users can view patent document's attributes including patent claims, selected alignment views and sequence annotation, if available, and embed or download results in various formats.

PatSeq Explorer enables a multi-level visualization and navigation of patent-disclosed sequences that map under various homology thresholds to a reference genome. At the genome and chromosome levels, users can investigate overall patenting trends, filter, and search sequence and patent attributes, and link to various sets of patent documents in the Lens (Fig. 4). Mapped sequence entries are displayed based on their location in the patent document (Grants in claims, Grants, Applications in claims and Applications) and their type (nucleotide or peptide), along with a summary statistic view for the overall coverage per jurisdiction. All views are embeddable.

Figure 4: Patent sequences, which map to the human genome with various homology thresholds, can now be investigated using PatSeq Explorer-Human genome (http://www.lens.org/lens/bio/patseqexplorer).
figure 4

In this example, disclosed sequences in patents with applicant “Myriad” and which map to the human genome with 90–100% similarity and 100% coverage are displayed in PatSeq Explorer. Under 'Filters' option, users can view patenting trends based on either publication or filing dates, or filter based on jurisdiction, sequence length, species or document type. Under 'Search' option, users can interrogate the data based on patent attributes such as claims, applicant, owner, inventor and classification. In the chromosome view, added features include linking to the OMIM database (turquoise panel) for associated disorders and traits on that particular position, viewing the document collection in the Lens at http://www.lens.org (brown panel), and analyzing the data at the loci/gene/sequence regions using PatSeq Analyzer.(orange panel).

PatSeq Analyzer allows users to zoom in to the details of a particular sequence entry and enables comparative analysis within the context of a gene region. The tool is a modified genome viewer built and integrated into PatSeq Explorer based on the open source HTML5/SVG genome maps browser by the Computational Medicine Institute, Prince Felipe Research Centre, Valencia, Spain30. In addition to the dedicated patent sequence tracks, PatSeq Analyzer provides feature tracks from public genome annotation datasets (including SNPs and gene/transcripts). Conversely, all views in PatSeq Analyzer are embeddable.

Global patent offices and sequence listings

Although highly controversial, patents on isolated genomic sequences are still allowed in many countries, including within Europe, Canada, Australia and Japan31. In addition to providing more precise analytic tools, with the increasingly globalized markets and innovation, there is now an urgent need for shared, harmonized data to guide decision-making and to accommodate differences in patent practice and policy in diverse jurisdictions. What are the standards and practices regarding patentability of nucleotide or amino-acid sequences in the various patent offices? And how do they make sequence listings data available?

We have carried out a survey of patent offices (Table 2 and Supplementary Table 1) to ascertain the standards and practices regarding the patentability of nucleotide and amino-acid sequences. Between July and October 2011, we mailed 55 patent offices around the world a series of questions modeled on those from the 2001 survey conducted by the World Intellectual Property Organization (WIPO) Intergovernmental Committee on Intellectual Property and Genetic Resources, Traditional Knowledge and Folklore32, together with some additional queries on the public availability of sequence listings data. Supplementary Table 1 displays the information received from various patent offices based on 2011 Cambia's surveys and compared with WIPO 2001 survey results. Table 2 displays the 33 responses received on the public availability of sequence listings data.

Table 2 Survey results on public availability of biological patent sequences from patent offices in 2011

We did not get a direct response from patent offices in the following countries: Argentina, Brazil, China, Colombia, Costa Rica, Cyprus, Dominican Republic, Guatemala, Honduras, India, Italy, Luxembourg, Mexico, Monaco, Korea, Hong Kong, Philippines, Singapore, Sweden or the USA. However, we were able to gather some information on the practices of some of these offices by alternative correspondence and investigations of their public websites.

Almost all patent offices—with the exception of Israel's—indicated that they comply with the Standard ST.25, which is the agreed standard for disclosure of sequences associated with patent filings. Unfortunately, this standard does not stipulate machine readability. So even though most offices make sequence listings available as part of the published patent document, these listings are mostly pdfs or images and hence not in a machine-readable form. Only patent offices in the USA, Canada, Germany, Hungary, Japan and to a certain extent Korea provide machine-searchable sequence listings through third-party providers or electronic downloadable files through their own websites, often for a fee.

Although we were able to obtain accurate counts on the total collection of sequence listings from some jurisdictions, it was difficult to get information from many others because they do not keep records on submitted sequence listings in their jurisdiction or they rely on the regional patent offices such as the European Patent Office (EPO; Munich, Germany) and WIPO for that information. Even in the United States, where compliance with sequence rules is more rigorously observed, we found several thousand sequence listings cited in patents published since 1990 that were not included in the GB-PAT database (Table 3), perhaps because of a lack of machine-readable forms, improper standards compliance or errors in sequence processing. Although some commercial entities are presumably able to negotiate access to patent sequence information from some jurisdictions, render it machine searchable and provide it on a fee-based service to the public (http://www.prweb.com/releases/2010/12/prweb4865134.htm; and WIPO with STN online patent information database), such initiatives require subscription fees.

Table 3 Number of patent documents, which contain formatted sequence listings, as extracted from United States Patent and Trademark Office (USPTO) full text and bulk listings data sets and compared to those available at GenBank-Patent division (GB-PAT) from 1982–2013

In conclusion, although Myriad clarified the position of the United States on gene sequence claiming, the court decision also highlighted the pressing need for nuanced and precise analysis of gene patents at the global level. Our survey results confirm that public tools are not yet available in many of the emerging patent offices; thus, biological innovations that rely on genetic sequences can be severely affected when reaching global markets. In this article, we present a carefully designed public platform that can be a valuable alternative to the commercial services that serve only a few elite innovators in biological sciences.