Has the yo‐yo stopped? An assessment of human protein‐coding gene number

Since the identification of ∼ 25 000 proteins from the draft human genome assembly in 2001, estimates of the total have oscillated between 30 000 and 70 000. The recently announced genome closure has not generated a consensus gene count despite this being a key parameter for many areas of biology including drug target discovery and characterization of the human proteome. Contrary to earlier predictions of constitutive under‐detection for eukaryotic genes, the latest model organism updates have produced minor increases in the worm but fly and yeast gene numbers have decreased. The postdraft, precompletion interval has produced large increases in human transcript coverage, continuous improvements in genome assembly and refinements in automated genomic annotation. Notably these enhancements have resulted in an Ensembl human protein‐coding gene number of 22 184, a decrease of 1862 since the first release. Longitudinal database surveys indicate that redundancy‐reduced human mRNA and protein collections are flattening out at ∼ 28 000, although Ensembl maps ∼ 20 000 known sequences. Observations suggest high‐throughput cloning projects are predominantly extending known genes or sampling new splice forms and novel protein discovery has slowed to a trickle. The hypothesis that substantial numbers of short proteins remain experimentally and computationally undetected in mammalian genomes is neither supported by sequence data nor by the extensive homology between mouse and human proteins. Aggregating the independent annotations for complete transcripts from seven completed human chromosomes extrapolates to ∼ 25 000 genes. The inclusion of partial putative genes would increase this to above 30 000 but recent data suggest these represent predominantly nonprotein‐coding transcripts. Mass spectrometry‐based proteomics has already verified more than 10% of human genes but has not identified significant numbers of unpredicted proteins. The available data are thus converging to a basal protein‐coding gene number well below 30 000, which could even be as low as 25 000.


Introduction
During the run up to the human genome in the year 2000, there was some surprise when the second, but arguably more complex, metazoan genome sequence from the fly turned out to have some 5000 less genes than the worm. This surprise was compounded when both the public and commercial versions of human sequence facilitated the annotation of only , 25 000 high confidence genes with an estimated maximum upper boundary of , 35 000 [1,2]. However, a range of higher estimates with upper boundaries of 70 000 have continued to appear (see [3] for review). A selection of these is shown in Fig. 1.
This review will consider recent data supporting the number of human protein-coding genes, although there is evidence that higher mammals share similar gene numbers [6]. As was pointed out after the initial analysis of the human genome, gene number is a central issue for biologi-cal complexity [7]. It is also a key parameter for other aspects of human biology such as defining the upper limits for the number of potential drug targets or therapeutic proteins and setting the baseline for initiatives to characterize the human proteome [8,9].
The Guidelines for Human Gene Nomenclature define a gene as follows: "a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology" [10]. This review will be restricted to assessing numbers of protein-coding genes, defined as chromosome-derived transcripts giving rise to one or more protein forms with shared sequence identity that assign them as products of a single genomic locus and strand orientation. This must be considered as a baseline number because vertebrates produce many protein forms via multiple initiations, alternative splicing, post-translational modifications, constitutive proteolytic processing, and common genetic polymorphisms. The mammalian proteome has therefore been estimated to be at least an order of magnitude higher than the protein gene number [11].

Discovering eukaryotic genes
The delineation of proteins encoded in eukaryotic genomes utilizes the following types of bioinformatic and experimental evidence [12,13]. (i) Ab initio prediction of potential ORFs from genomic DNA; (ii) Detection of known protein identity or homology in genomic DNA; (iii) Matches with ESTs that have coding potential and/or splice sites; (iv) Cross-species comparisons for homologous gene detection; (v) Gene anatomy features associated with ORFs such as CpG islands, core promoters, transcription start sites, splicing signals, polyadenylation signals and the absence of repeat elements; (vi) Cloning Figure 1. Estimates of human protein-coding gene number based on different lines of evidence. Taken from [3] with the addition of more recent papers [4,5].

High gene number arguments
The following arguments have been put forward in support of a final human gene number significantly above 30 000. (i) Model eukaryotes will show a postcompletion rise in gene number; (ii) The human genome assembly is incomplete; (iii) Gene prediction programs have a significant false-negative rate; (iv) Automated genome annotation pipelines are conservative; (v) Chromosome curation teams find genes missed by automated pipelines; (vi) Transcript coverage by mRNA and EST entries is incomplete; (vii) Novel proteins continue to be reported; (viii) Human/mouse/rat genomic comparison identifies many conserved sections; (ix) Sampling experiments, by proteomics, RT-PCR, SAGE, MPSS and microarrays, have revealed new genes; and (x) A fraction of rapidly evolving small proteins remain computationally and experimentally undetectable.
Although each of these propositions may be true for specific individual genes, or may have been true in the past, it will be argued in the sections below that only point (v) has the supporting data indicating a potentially significant contribution to the current gene total. The significance of argument (x) needs further explanation because it postulates a large undiscovered set, i.e. of the order of 3000 to 10 000 human proteins, that share the following characteristics; a propensity to be missed by ab initio gene prediction, not sampled in any mammalian mRNA or EST data and they cannot be detected by homology searching of known proteins against genomic data. In addition they are implied to be single-exon proteins with a low level and/or restricted pattern of tissue expression. The low sequence similarity to known proteins implies they have evolved rapidly and may be clade-specific. It is important to recognize that these characteristics have to be combined. This reduces the probability that large numbers of such genes have evaded detection. The sections below will consider the relationships between the types of evidence and arguments for high gene numbers in the context of current data.

Genome annotation of model eukaryotes
Even before completion of their genomes the yeast, worm and fly were the focus of international efforts to identify and characterize all of their proteins. Essentially all the lines of evidence described in Section 2 have been utilized in this undertaking. The yeast alone has been the subject of hundreds of genome-wide analysis papers including large scale functional profiling [14]. The results, in addition to being published and submitted to the primary databases, have been aggregated in major web portals such as the Saccharomyces Genome Database (SGD), Flybase, and Wormbase [15][16][17].
However, the provision of updated and definitive eukaryotic gene sets remains problematic. Despite being completed some six years ago the yeast protein numbers vary according to source [12,18]. Checking the three major portals for Saccharomyces cerevisiae (in October 2003) gave ORF totals of 6202 from the European Bioinformatics Institute complete proteome set, 5878 from SGD and 6723 from the Comprehensive Yeast Genome Database [19][20][21]. Additional discordance occurs in the literature. An approach focusing on small proteins reported 137 new genes and a total of 6000 proteins in 2002 [22]. An analysis later in the same year proposed a downward revision to 5400 [23]. The most recent analysis, in 2003, was based on comparisons with sequences of three related species, Saccharomyces paradoxus, Saccharomyces mikatae and Saccharomyces bayanus [24]. This resulted in revision of ,15% of all genes and a total reduced to 5726.

Postcompletion gene number changes
The changes in gene totals, between first releases and latest updates, along with more recently sequenced model eukaryotes, are shown in Fig. 1. By taking the latest published revised number S. cerevisiae has decreased by 9% over seven years [24,25]. The 441 new genes in Caenorhabditis elegans accumulated over five years represent an increase of 2% [26,27]. However, recent experimental evidence indicates the C. elegans ORF collection could still include a substantial proportion of pseudogenes [28]. The Drosophila gene number, initially determined as 13 601, has been revised down to 13 379, a drop of 2%, after a recent major re-annotation [29,30].
The Drosophila re-annotation paper touches on several themes that are relevant to the human genome and individual chromosome reports discussed in Section 5.5. Firstly, community annotation and human curation have  made key contributions to the quality of revised gene sets that include changes in approximately 45% of the predicted proteins. Secondly, as chromosome assemblies improve, the discovery of new genes is balanced by a reduction in gene number from the fusion of previously fragmented gene predictions. Thirdly, as more transcript data has accumulated, the average transcript length and total exon count have risen without increasing gene number. Lastly, the latest revision encompasses 1042 additional homology-based candidate genes detected immediately after the initial genome release [31]. Thus, what was initially proposed as a 7% increase has become a 2% decrease.
The yeast, worm and fly can now be compared with additional completed organisms that are included in Fig. 2. The second yeast, Schizoaccharomyces pombe turns out to have 23% fewer genes than S. cerevisiae [32]. The recent completion of a second insect, the Anopheles mosquito, gives a figure of 12 981 and the malarial protozoan Plasmodium faciparum 5268 [33,34]. The first simple chordate Ciona intestinalisalis containing many vertebrate protein families gives a figure of 15 852 and even this is estimated to be a 5% over-prediction because of fragmentation in the draft assembly [35].

Proteomic sampling in model eukaryotes
The identification of 1484 yeast proteins, i.e. 23% of all predicted genes, by mass spectrometry, did not initially report any novel proteins [36]. However, a later survey, combining expression profiling with mass spectrometry, claimed the addition of 62 new yeast genes [37]. A recent analysis of the P. faciparum genome identified 1289 proteins from selected parasite stages, corresponding to 24% of the predicted proteins [38]. This report included 100 unmatched peptides but the numerical relationship between these orphan peptides and novel proteins remains to be elucidated. Although these reports demonstrate the potential of mass spectrometry-based proteomics to detect eukaryotic proteins that have eluded in silico annotation there is no evidence of a significant impact on gene totals.

The golden path genome assembly
The accuracy of gene prediction is crucially dependent on the quality of the underlying genome data. The early genome assemblies or reference sequences, colloquially known as the Golden Path (GP), were produced at the University of California at Santa Cruz [39]. Subsequently, the International Human Genome Sequencing Consortium has used assemblies generated by the National Center for Biotechnology Information (NCBI). The latest human reference sequence is based on NCBI Build 34 [40]. This covers about 99% of the gene-containing regions in the genome, and has been sequenced to an accuracy of 99.99%. The missing portions are comprised of several hundred defined gaps representing DNA regions with unusual structures that cannot be reliably sequenced using current technology.

Ensembl gene totals
Ensembl has become the de facto standard for automated annotation of eukaryotic genomes [41]. The end result is a set of transcripts grouped into genes by shared  exons and supported by evidence of at least one form of sequence homology. Since the release based on the first draft of the human genome, Ensembl has gone through seven major releases of the GP. The gene numbers associated with these releases are shown below in Fig. 3.
The key feature in Fig. 3 is that the total, currently at 24 037, has decreased by nine genes since the first release at 24 046. If the pseudogene count is factored in, the gene number is reduced to 22 184, representing a decrease of 1862 since the first release. The maximum seen in the January 2002 release may be associated with clone orientation changes between the University of Santa Cruz hg8 genome assembly and NCBI GP26 [43].

Underlying trends
A number of important trends can be discerned from the Ensembl release statistics. The first is the detection of a higher proportion of known genes, rising from 90% to over 95%. The second is that the number of novel genes, defined as those less than 95% identical to known human proteins at build time, has fallen from the maximum of These trends suggest that novel predicted genes are being converted to experimentally confirmed proteins against the background of an essentially static gene num-ber. Genes have also been growing in average length and exon count. Similar effects of reduced genomic fragmentation and improved transcript coverage on gene annotation have been documented for Drosophila and for Arabidopsis [30,44].

Curation of completed chromosomes
The completed human chromosomes, 6, 7, 13, 14, 20, 21 and 22, have been subject to major curation efforts [4,[45][46][47][48][49][50][51]. These analyses have been performed by large teams of authors from different groups. They include experimental confirmation of putative genes and, in some cases, the use of fish as well as mouse gene predictions to broaden the range of homology detection. They also include a large component of manual curation. Compared with automatic pipelines this provides a more reliable annotation of pseudogenes, splice variants, polyadenylation sites, and incomplete putative transcripts (reviewed in [52]). They therefore produce independent gene counts that can be compared to automated gene annotations. The continuing work of annotation groups for chromosomes 6, 13, 14, 20 and 22 is now included in the Vertebrate Genome Annotation (VEGA) database [52,53]. This process includes a virtuous cycle whereby novel genes are cloned and submitted to public databases, thereby ensuring their incorporation into subsequent genome annotations.
Ensembl uses a similarity cut-off to classify gene products as known or novel. The annotation groups, including those contributing to VEGA, have introduced more graded definitions. Known genes are classified by their presence in the NCBI RefSeqNP collection rather than any identity match within the larger SWISS-PROT/ TrEMBL (SPTr) human data set used by Ensembl. Novel transcripts are subdivided into three categories, novel coding sequences (CDSs) where an ORF can be determined, novel transcripts where none of the alternative potential ORFs can be frame-fixed by homology and putative transcripts, where spliced ESTs define intron/ exon boundaries but are not sufficient to define an ORF. Annotated pseudogenes are defined as similar to known proteins but have frame shifts or premature stop codons which disrupt the ORF. An example comparison of VEGA and Ensembl annotation is shown below in Fig. 4.

Comparing curated chromosomes with automated genome annotation
The use of graded definitions and categories of supporting evidence by the chromosome curation teams clearly represents the reality of detailed gene annotation. However, comparisons with Ensembl numbers and extrapolations to total gene counts are difficult for the following reasons: (i) Groups reporting on 7, 14 and 21 have used their own assemblies rather than the public GP version; (ii) Publications from these groups have used different supporting evidence and definitions of gene categories. For example, not all publications include putative genes in the total count; (iii) Each chromosome report was made at different times and was therefore compared against different sets of public transcript data; (iv) The two groups reporting approximately simultaneously on chromosome 7 show different gene numbers and use different pseudogene definitions [46,51]; (v) None of the groups have yet published a formal gene cross-mapping to Ensembl to determine the basis for any systematic discrepancies, al-though at least for 6, 13, 14, 20 and 22, the result sets for Ensembl and VEGA can be visualized in the same browser (Fig. 4); (vi) Only one group has presented a longitudinal update of gene changes [4]. Nevertheless it is informative to attempt a comparison of the numbers between the curated and automated gene build sets. This is presented below in Table 1.
The data in Table 1 represents , 15% of all genes. A surprising observation is that Ensembl records , 20% more known genes. This is likely to be because knowns are defined by identity matches in SPTr in Ensembl rather than the smaller RefSeqNP set used by most of the chromosome reports. The second trend is the high ratio of pseudogenes: total genes of , 1:2.8. The third trend is the impact of putative genes on the total. If these are excluded then Ensembl finds, on average, , 3% less genes than the chromosome groups. If the putatives are included this drops to , 20% less.    6  772  287  213  285  633  1272  1296  1037  259  7  863  71  521  213  144  1455  1269  925  344  13  228  101  132  161  281  461  455  322  133  14  510  119  23  207  296  652  736  572  164  20  448  99  68  258  173  727  704  614  90  21  176  29

Pseudogenes
A potential source of false-positives in homology-based genome annotation is pseudogenes [55]. Estimates of the total number of these in the human genome have risen to 20 000 [56]. It seems certain that a proportion of these may have been counted as genes in automated annotation pipelines, especially those pseudogenes that have only minor disablements that distinguish them from functional paralogues. The mouse genome paper includes a formal estimate of 4000 pseudogenes that could be present in their gene total (see Section 7). The latest human Ensembl release has provided automatic annotation of 1853 pseudogenes and the latest rat release 1592. The situation is made more complex by the existence of transcribed pseudogenes because these would have both homology and supporting evidence of transcription. According to the current LocusLink statistics 115 of 2613 annotated human pseudogenes (4.4%) fall into this category [57].

Postgenomic coverage of protein and transcript data
Locating transcript identities and detecting protein homology in genome data are directly related to coverage in nongenomic data. The exponential growth of the primary nucleotide databases since the first draft human genome has included a massive increase in human mRNA and protein data. The EST data are also used for support-ing evidence for predicted genes. Data from other mammals or vertebrates can be used for homology detection. The growth of these sources will be considered below.

Human protein databases
The International Protein Index (IPI) merges a set of experimental and predicted sequences derived from SWISS-PROT, TrEMBL, RefSeqNPs, RefSeqXPs and Ensembl [58]. This provides a resource of complete sets of human, mouse and rat proteins represented by one sequence per transcript with statistics on the overlap between sequences from the different sources. The process also includes a redundancy-reduced set of sequences from the larger SPTr set. The growth of human protein numbers in SPTr is shown in Fig. 5.
The human IPI total shows a surprising rise and fall since its first release of 33 013. The peak of 67 105 in mid-2002 was due to the deposition by the NCBI of large numbers of ab initio predictions (RefSeq XPs) into their protein databases [59]. Continual revision of the NCBI genomic annotation pipeline has reduced this set and consequently the IPI number has now fallen to 39 440.
The main feature in Fig. 5 is that the number of new proteins in the redundancy-reduced version is small (there are technical issues associated with the promotion of sequences from TrEMBL to SWISS-PROT and subfragment merging that cause intermittent drops in these numbers  activity for cloning human genes over the last 18 months has produced an increase of , 4800 new proteins. In addition this has perceptibly flattened out over the last five releases at , 28 000. This suggests that, although they may be covering more splice forms, extending previously partial sequences or confirming predicted entries, new protein submissions are predominantly resampling previously identified genes. Over the same time period covered in Fig. 5 the Ensembl gene total (Fig. 3) actually fell. Although there are caveats to comparing Ensembl protein sequences with SPTr sequences, for example the persistence of some early incorrect ab initio gene predictions in TrEMBL, the number of novel genes in Ensembl fell by , 8000 over this period. This exceeds the , 5000 new genes appearing in SPTr over the same period but some of the novel genes are likely to have been removed by the correction of previously fragmented genes in earlier assemblies. These trends suggest that new sequences appearing in the protein databases are making a diminishing contribution to the gene total.

Small proteins
As described in Section 3 a key argument for high gene numbers, in quantitative terms, is the hypothetical existence of large numbers of hitherto undetected small open reading frames of less than 100 residues, termed small open reading frames (smORFs). These are difficult to detect both experimentally and computationally [60]. The likelihood of this category contributing large numbers of new proteins was assessed by looking for smORF increases in recent and/or novel sequences submitted to the protein databases. A submission date was found (October 2001) that divides SPTr roughly in two. The protein length distribution of the old and new halves of the databases was then compared. The most recent data showed, 5.5% of proteins below 100 residues, compared to 6.3% in the older data. Selecting entries that were classified by authors as "novel" at the time of submission showed only 3.4% of these sequences were below 100 residues. Thus, the size distribution of experimentally determined human proteins indicates the proportion of smORFs is falling.

Human mRNA databases
One way of addressing gene sampling in transcript data is to look at the longitudinal statistics of the UniGene system [61]. This includes a count of mRNAs along with EST cluster counts on the basis of shared sequence identity. Plots of the growth of these sets of data are shown in Fig. 6. Figure 6 shows the contrast between the rapid growth of total submissions and the slow increase in the clustered set (the sporadic falls in mRNA occur when bulk preliminary submissions are replaced with revised sets). On average each unique mRNA has been sequenced 3.8 times. The slow increase suggests new entries are predominantly sampling the same set of transcripts.
The data above suggests the redundancy-reduced human mRNA and protein collections are converging to , 28 000 genes. This presents a paradox in that they significantly exceed the 19 921 known protein entries in Ensembl 18.34. However, there are technical caveats associated with the process of redundancy reduction such as increasing numbers of splice variants with significant length differences. The lower numbers in Ensembl   are therefore not only due to effective redundancy removal by explicit mapping onto the genome but also, according to the latest release notes, that a significant proportion of apparently novel mRNA protein-coding genes are artefacts generated from chimeras and spurious ORFs translated from noncoding transcripts or pre-mRNA. Supporting evidence for this is that the current nonredundant RefSeqNP collection from all human mRNAs produces only 18 112 uniquely mapping proteins.

EST coverage of mRNAs
The current dbEST is approaching 10 million mammalian ESTs of which 5.3 million are human [62]. Tissue sampling has expanded to include more specialized sources such as human tumor cell lines, cow foetal libraries and mouse embryo stages. In addition, many subtractive strategies have been employed to enhance the detection of lowabundance transcripts. This has been accompanied by major initiatives in the USA, Japan, Germany and other countries to convert as many human and mouse ESTs as possible to full-length mRNA sequences.
The argument (from Section 3) that a significant fraction of the expressed genome remains unsampled becomes less tenable as the tissue breadth and sequencing depth in dbEST expands. A predicted consequence of incomplete EST coverage would be that targeted cloning, including the confirming of genes predicted from genomic data, should increase the proportion of mRNAs that are not represented in EST data. In fact the UniGene data suggest the opposite trend i.e. EST numbers and mRNA coverage have increased in parallel. The comparison between the five mammals with the most sequence data are shown in Table 2. www.proteomics-journal.de than pig. The clusters, which can be considered as a redundancy-reduced set, also rank in the same order.
The last column indicates that over 95% of unique human mRNAs have been sampled by human ESTs. A longitudinal assessment derived from the data sets used to prepare Fig. 6 shows that this figure has climbed slowly from , 92% over the last two years.
Does this indicate that dbEST is approaching saturation for human gene coverage? The fact that 5% of mRNA clusters are still not sampled by ESTs suggests saturation has not been reached but the longitudinal statistics do show a flattening out. Non-sampled genes include classes of proteins, such as seven-transmembrane receptors, that are known to have very low EST coverage. However, they are not lost in the human gene count because they are relatively easy to find by homology searching in genome data [63]. In addition, even if a human gene had no matches to human ESTs there is an increasing likelihood of an orthologous EST match i.e. the aggregate of 10 million mammalian ESTs may well be approaching saturation sampling of mammalian protein-coding transcripts.

Mammalian protein increases
The discovery of novel proteins uses protein similarity matches as supporting evidence for putative exons in genomic DNA. One of the arguments for high gene number listed in Section 3 is the speculation that the human genome (and by extrapolation other mammals) encodes significant numbers of proteins with similarity matches below the thresholds used for exon detection. The increase in the mammalian protein numbers over a three year postgenomic period is shown in Fig. 7.
The data in Fig. 7 show that mammalian sequence coverage exhibited a , fourfold increase in numbers available for homology searching over the two years since the first human draft. While this could arguably still be biased against very rare transcripts it includes the worldwide output of high-throughput cloning projects and the results of over 30 years of targeted gene discovery. The entirety of mammalian proteins therefore represents , 4.5-fold averaged genome coverage. The recent analysis of the Fugu genome shows that , 75% of fish ORFs have significant similarity scores with human sequences [64]. This suggests at least some of the , 40 000 non-mammalian vertebrate sequences could also contribute to mammalian annotations.

Comparing human with other vertebrate genomes
The human genome has now been joined by substantially complete assemblies for mouse, rat and fish [64][65][66]. The total gene numbers, transcripts and exons per gene are shown below in Table 3.
The mouse gene catalogue was compiled with the Ensembl pipeline and the use of ESTs to support gene predictions. An important conclusion from the mouse paper is that approximately 80% of mouse genes have 1:1 orthologues and the number of genes without human homologues is only , 1% [65]. By aligning the results from human and mouse a separate annotation effort was able to increase the specificity and sensitivity of ab initio gene prediction [67]. This detected approximately 12 000 additional exons which the authors, after selected PCR validation, suggest could add , 1000 new genes. There could also be in the order of 1000 genes in the missing 4% of the mouse assembly. However, these potential falsenegatives would more than be balanced by the estimate of , 4000 pseudogenes. Given the prediction of slightly higher gene numbers because of the , 500 rodent odor receptors that have human pseudogene orthologues, the Ensembl gene numbers for both rodents seem low (Table 3). However, this may result from the relative transcript coverage, shown in Fig. 7, that shows the same order of declining gene number i.e. human . mouse . rat.
The mouse genome issue of Nature included a comprehensive analysis of all available mouse mRNAs although these were not directly mapped to genome data [68]. The number breakdown is complex but the team collapsed 60 770 highquality cDNA sequences, 39 694 of which were new, with 40 106 public mouse mRNAs to produce 37 409 representative transcript units. Of these only 20 489 could be determined as protein-coding.
The Fugu fish has come out higher than both humans and rodents with just over 35 000 genes. However, it is known that teleost (ray-finned) fish contain larger numbers of duplicated genes compared with lobe-finned fish and tetrapods such as mammals [69]. In addition, it has a lower number of exons per gene that suggests there may still be some fragmentation within the draft sequence that has artificially elevated the gene number.  [71,72]. Such transcripts, termed transcriptionally active regions (TARs) could represent novel genes, novel exons in known genes, extensions of previously undetected 5' or 3' untranslated regions (UTRs), transcribed pseudogenes, or noncoding transcripts of unknown function. The fact that neither publication presented any complete novel ORFs derived from their data argues in favor of the latter category. The combined data from these two publications also suggest that these TARs are expressed at a low level, they include antisense transcripts and many show sequence conservation between human and mouse.
The existence of TARs outside the boundaries of known protein-coding genes is supported by data from other sources. A recent mouse transcript analysis paper reported 4280 transcripts that lacked protein-coding potential but some had poly-A tracts and matches in both mouse and human ESTs [73]. Another paper has reported 2431 pairs of overlapping sense-antisense pairs from mouse transcript data [74]. A similar number of approximately 1600 human antisense transcriptional units have been verified by stand-specific microarrays, some of which were also represented in EST data [75]. These antisense transcripts are suggested to have some kind of regulatory function rather than coding for proteins [76].

Increasing the stringency of gene identification
Nearly all known proteins conform to the universally confirmed features of gene anatomy. A diagram of these is shown in Fig. 8.
In the past, merely the detection and/or determination of short sequences of transcribed fragments in cytoplasmic mRNA fractions has been used as sufficient evidence to infer the existence of new protein-coding genes. However, as described above in Section 8 it now seems that a significant proportion of TARs (in coverage rather than quantitative terms) are noncoding. This may explain the previously reported high gene numbers from transcript sampling experiments performed without corroborative full-length cloning [78]. Even the presence of protein homology in transcribed fragments can no longer be relied upon, not only because of expressed pseudogenes but also because of the reported existence of ancient protein "fossil" fragments in intergenic regions of the human genome [79]. This means that claims of novel protein discovery now require not only the determination of a transcript with an ORF that conforms to the universal features of gene anatomy but also submission of that complete sequence to the primary databases.

Diminishing novel gene discovery
Describing a human protein sequence as novel used to imply that it had been experimentally characterized for the first time and that the sequence had no previous record in the databases. More recently the utility of this description has diminished because most targeted gene discovery efforts now usually find a pre-existing sequence from a high-throughput sequencing project or the purported novel protein has already been predicted from genomic data and/or EST-derived virtual transcripts in the public domain [13].
The extent of diminished novelty in the postgenomic era can be illustrated by three examples. The first is an examination of database entries that include the term "novel" both in the description of the gene and in the title of a publication. Restricting the query to the first quarter of 2003 found only 11 unique sequence entries. Only three had neither previous database entries nor Ensembl annotation and even one of these was partially represented as an RefSeqXP. Therefore only two of the 11 purportedly novel entries from the first quarter of 2003 had evaded previous independent sequencing or genome annotation, and, perhaps significantly, these were the only two genes without human EST coverage.
The second example illustrates the effectiveness of highthroughput human gene discovery projects. In 2002 two groups independently published a novel cysteine and tyrosine-rich protein on chromosome 21 (AY061853 and AF401639) which they both released to the databases in early June [50,80]. Checks established that the sequence was substantially covered by an EST from two years before and now had no less than six full-length database entries. These included a patent publication and three submissions from high-throughput cloning projects from Germany, Japan and USA. As expected Ensembl now includes this as a known gene on chromosome 21.
The last example comes from a recent analysis of ESTs from a retinal pigment epithelium project [81]. With over 9000 sequence reads submitted to dbEST this represents one of the deepest samplings of a specialized human tissue. The single reported novel gene oculospanin (AF325213) had been deposited in GenBank in December 2000 and has subsequently been sequenced twice by high-throughput projects. These examples suggest that the rate of targeted novel gene discovery is slowing as the combination of experimental and predicted human gene coverage approaches saturation.

Proteomic sampling of human proteins
Analogous to the experiments described in Section 4.3 for the model eukaryotes an increasing number of largescale mass spectrometry-based identifications of human proteins have appeared. These include 615 from the human heart mitochondria, 500 from breast cancer cell membranes, 491 from microsomal fractions, 490 from blood serum, 437 from foetal brain and 311 from the splicesome [82][83][84][85][86][87].
Although the total from these publications is approaching , 10% of human genes there is paucity of de novo determinations of novel sequences. This may be related to the technical challenges of scoring MS/MS tryptic peptide matches to genomic DNA as opposed to scoring spectral matches against translated and/or predicted protein sequences [88,89]. However, an alternative explanation is that there are simply too few undiscovered genes being sampled by these experiments. Although mass spectrometry-based peptide identification will make a major impact on identifying post-translational modifications and correcting exon assignments there are no data to suggest this will have a significant impact on gene number.

Conclusions
Significantly, and perhaps unexpectedly, postcompletion revisions have produced gene number decreases in yeast and fly and only a minor increase in the worm. This has occurred despite the massive worldwide experimental and computational focus on these organisms. This is neither to suggest that gene discovery in the model eukaryotes is at an end nor that detailed revision of gene structures will not continue. However, it argues strongly against constitutive underdiscovery for eukaryotic genes.
The interval between the first draft assembly and the closure of the human genome announced on the 14 April 2003 has seen big increases in human mRNA coverage, EST production and continual refinement of automated genome annotation [90]. These advances, also perhaps unexpectedly, have produced a fall in the Ensembl gene count to just above 22 000. The initial mouse and rat annotations also indicate pseudogene-adjusted numbers close to this figure. In parallel, the public protein collections are showing a diminished rate of novel human gene discovery suggestive of saturation.
So is there any evidence for higher gene numbers? Certainly the total numbers for all categories extrapolated from the aggregated individual chromosome reports could be above 30 000. However, excluding the putative gene category reduces this to , 25 000. The total therefore depends on the future status of the novel and putative transcripts reported by the chromosome annotation groups. Although some of these will be found to have the gene anatomy features necessary for protein expression it seems likely that the majority will not be verified as protein-coding genes.
The postulate that anywhere between 10-30% of human genes consist of experimentally and computationally undiscovered small proteins seems untenable for the following reasons. Firstly, the mouse genome paper estimated that only , 1% of mouse genes had no detectable human homology [65]. Secondly, as shown in Section 5.8, there is no proportional increase in small proteins amongst recently discovered genes. Thirdly, proteins shorter than 100 residues may fall below the threshold necessary to fold into functional domains [91]. Fourthly, although small proteins can evolve more rapidly, there is no precedent in the literature on protein evolution for the existence of large numbers of clade-specific mammalian proteins that have nonsynonymous mutation rates (K a /K s values) so high that they cannot be detected by crossspecies protein sequence similarity [92].
So what would constitute gene number closure? Bioinformatic approaches will increasingly use comparative genomics to refine mammalian gene sets [93]. Experimentally, it could be possible to use MS to verify at least one unique peptide from all proteins isolated from in vivo sources. Although this goal may be too ambitious for current technology it is at least implicit in the aims of the Human Proteome Organisation (HUPO) and there is already a commercial initiative underway that has verified over 14 000 proteins from human cell lines and tissues [94,95]. However, in the short term we are more likely to see "closure-by-expression-in vitro" and some academic centers are already building up the requisite human clone collections [96].
Given the lingering uncertainties for yeast protein number any short term expectation of human closure with four times as many genes seems unrealistic. However, we might hope to reach a point beyond which significant numerical changes would be unlikely. The current evidence points to this being well below 30 000 protein-coding genes, possibly as low as 25 000.