Improved annotation of a plant pathogen genome Xanthomonas oryzae pv. oryzae PXO99A

Many bacterial genomes have been sequenced and stored in public databases now, of which Reference Sequence (RefSeq) is the most widely used one. However, the annotation in RefSeq is still unsatisfactory. The present analysis is focused on the re-annotation of an important plant pathogen genome Xanthomonas oryzae pv. oryzae PXO99A (Xoo PXO99A), which is the causal agent of bacterial blight on rice. Based on the parameters of 28 nucleotide frequencies and support vector machine algorithm, 41 originally annotated hypothetical genes were recognized as noncoding sequences, which were further supported by principal component analysis and other evidence. Ten of them were tested with reverse transcription-polymerase chain reaction experiments (RT-PCR), and all of them were confirmed to be noncoding sequences. Furthermore, 197 potential new genes not annotated in RefSeq were both recognized by two ab initio gene finding programs. Most of them only have sequence similarities with part of the known genes in other species, so they are unlikely to be protein-coding genes. Twelve potential new genes have high full-length sequence similarities with function-known genes, which are very likely to be true protein-coding genes. All the 12 potential genes were tested with RT-PCR, and 11 of them (92%) were successfully amplified in cDNA template. The RT-PCR experiments confirm that our theoretical prediction has high accuracy. The improvement of Xoo PXO99A annotation is helpful for the research of lifestyle, metabolism, and pathogenicity of this important plant pathogen. The improved annotation can be obtained from http://211.69.128.148/Xoo.


Introduction
With the rapid progress of prokaryotic genome sequencing projects, more than 1900 bacterial genomes have been sequenced up to the written of this paper, thus providing an unprecedented opportunity to study the genetics, biochemistry, and evolution of bacterial species. Such analyses strongly depend on accurate annotation. Most sequenced bacterial genomes are stored in public databases including National Center for Biotechnology Information (NCBI) GenBank (Benson, Boguski, Lipman, & Ostell, 1997), European Bioinformatics Institute (EBI, http://www.ebi.ac.uk/genomes/bacteria.html), J. Craig Venter Institute (JCVI) Comprehensive Microbial Resource (CMR, http://cmr.jcvi.org), etc. Furthermore, NCBI's Reference Sequence (RefSeq) database is generated to provide reference standards for genome annotation, which can provide more curated, nonredundant, and rich annotation for bacterial species (Pruitt, Tatusova, Klimke, & Maglott, 2009). The first step in bacterial genome annotation is to use gene finding programs, which scan the genome and identify potential protein-coding genes and other functional RNA products. The identified genes are then compared to public databases in order to identify related sequences. If hits of a certain similarity are identified, information about their function is transferred to the new sequence (Guo, Ou, & Zhang, 2003;Skovgaard, Jensen, Brunak, Ussery, & Krogh, 2001;Stothard & Wishart, 2006). Since most of the genes are identified with gene finding programs and not verified by experiments, there are still many problems in current bacterial genome annotation (Guo et al., 2003;Nielsen & Krogh, 2005;Skovgaard et al., 2001;Stothard & Wishart, 2006). Firstly, many genomes have false-positive gene identification, i.e. some open reading frames (ORFs) are incorrectly annotated as protein-coding genes, most of them are short ORFs (<150 bp) without functional information. Secondly, many annotated genes have incorrect translation initiation sites (TISs). Finally, some actual protein-coding genes are still missed in the current annotation (Guo et al., 2003;Nielsen & Krogh, 2005;Skovgaard et al., 2001;Stothard & Wishart, 2006).
Xanthomonas oryzae pv. oryzae (Xoo) belongs to the gamma subdivision of Proteobacteria, which is the causal agent of bacterial blight (BB) on rice (Oryza sativa L.) (Lee et al., 2005;Nino-Liu, Ronald, & Bogdanove, 2006;Salzberg et al., 2008). BB is the most serious bacterial disease in tropical Asian countries where high yielding rice cultivars are often susceptible to this disease. In severely infected fields, BB can cause yield losses as high as 50% (Lee et al., 2005). Several factors that contribute to fitness and virulence of Xoo have been identified. Since rice is a major crop for Asian population and a model plant for cereal biology, a better understanding of pathogenesis caused by Xoo remains a pressing goal both for controlling of BB and for understanding of bacterialplant interactions (Salzberg et al., 2008).
The Xoo PXO99A genome has a single circular chromosome of 5,240,075 bp. The original annotation was submitted to GenBank with the accession number CP000967, which contained 5083 protein-coding genes (Salzberg et al., 2008). Subsequently, a comprehensive annotation containing 4988 protein-coding genes was provided in NCBI RefSeq (NC_010717). The annotation of Xoo PXO99A in EBI bacteria genomes also contained 4988 genes. However, the annotation of Xoo PXO99A in CMR contains 5200 genes (http://cmr.jcvi. org/tigr-scripts/CMR/GenomePage.cgi?org=Xoo). Since Xoo PXO99A genome has quite different annotation in different databases, it is highly necessary to re-annotate it and provide more accurate annotation for related researchers.

Date collection
The sequence and annotation of Xoo PXO99A genome were downloaded from NCBI RefSeq since it can provide a comprehensive and relatively precise annotation (Pruitt et al., 2009). The 4988 annotated protein-coding genes can be classified into two groups: the first contains 3452 genes with confirmed functions, and the second group contains 1536 genes with "putative," "probable," or "hypothetical" functions. The first 3452 genes have high sequence similarity with function-known genes in public databases, which were used for training data-set. Some of the 1536 genes with "putative," "probable," or "hypothetical" functions might not be protein-coding genes, whose coding status was identified in the current analysis. Furthermore, potential new protein-coding genes not occurred in the RefSeq annotation was predicted by two ab initio gene finding programs, i.e. Prodigal (Hyatt et al., 2010) and FgenesB, (http:// linux1.softberry.com/berry.phtml?topic=fgenesb&group= programs&subgroup=gfindb), respectively.
To complete the algorithm, two groups of samples were needed. One was a set of positive samples corresponding to protein-coding genes, and the other was a set of negative samples corresponding to noncoding sequences. The two groups of samples constituted the training set used in the support vector machine (SVM) algorithm described below. Since coding bases account for about 85% of Xoo PXO99A genomes (Salzberg et al., 2008), it was rather difficult to prepare an appropriate set of noncoding sequences. Therefore, each of the 3452 protein-coding sequences were randomly shuffled 100,000 times, and then served as negative samples.

Support vector machine
The general principle of SVM was to perform a classification by constructing an n-dimensional hyper plane that optimally separates the positive and negative samples into two groups. SVM minimized the empirical classification error and maximized the geometric margin. The margin was defined as the distance from the separating hyper plane to its nearest sample. SVM was based on the structural risk minimization principle which allows the building of predictive models, even if descriptors were numerous and redundant. The structural risk was a trade-off factor between the training error and the model complexity. Moreover, SVM was not significantly affected by unbalanced active and inactive data-sets (Crammer & Singer, 2002).
Radial basis function (RBF) is a commonly used kernel function with one parameter γ.
For the selected kernel function, the learning task is to solving the following convex quadratic programming Subject to: where labels, y i ¼ þ1; À1; stand for the positive label and negative label, respectively. Tool package used in the current analysis is LIBSVM (Chang & Lin, 2001).

Method for identifying new functional genes
In this analysis, two ab initio gene finding programs were performed to identify new protein-coding genes not annotated in RefSeq. Prodigal (Prokaryotic dynamic programming gene-finding algorithm) was an recently developed highly accurate microbial gene finding program, which has high speed, low false-positive rate, and high accuracy in locating the TISs (Hyatt et al., 2010). Fge-nesB was another accurate ab initio prokaryotic gene prediction program, which is based on Markov chain models of coding regions, translation, and termination sites. FgenesB included simplified prediction of operons based only on distances between predicted genes. Combining the predicted result of the two ab initio programs, new protein-coding genes not annotated in RefSeq were identified in Xoo PXO99A genome.

Strain cultivation and nucleic acid isolation
Xoo PXO99A was grown in 5 mL nutrient agar (NA, 5 g polypeptone, 1 g yeast power, 3 g beef extract, 15 g sucrose per liter) at 28°C for 24-36 h with 200 rpm shaking. Then, 250 μL of the suspension were added to 25 mL of NA and grown at 28°C to logarithmic growth phase. The sodium dodecyl sulfate (SDS) method was used for DNA extraction. The total RNA was extracted by RiboPure™-Bacteria (Ambion) and treated with DNase I to remove genomic DNA contamination.

PCR and sequence validation
The total DNA, total RNA, and cDNA were used for polymerase chain reaction (PCR) analysis. The 50 μL PCR mixture contained 5 μL 10Â PCR Buffer, 0.2 mM dNTP (Takara), 0.02 μM primers, 1 μL total DNA, total RNA or cDNA, 1 μL taq DNA polymerase (Takara), and nuclease-free water. The samples were incubated with the following cycles: 94°C for 3 min, 30 cycles of 94°C for 30 s, annealing for 30 s, 72°C for 1 min, and a final extension of 72°C for 10 min. First strand cDNA was synthesized using SuperScript TM II RT (Invitrogen) and then amplified following the protocol as remarked above. The PCR primers for the chosen sequences were designed by Primer Premier 5.0 software (Premier Biosoft International, Palo Alto, CA). The PCR reaction condition for every primer sit was optimized with DNA sample in repeated PCR experiments with 2-10°C lower than the predicted annealing temperature until a single amplified band was obtained. 16S rRNA gene was used as positive controls for multicopy gene. Each of the PCR products was purified using the PCR products purification kits (QIAGEN). The purified DNA fragments were ligated with pMD ® 18-T Vector Systems (Takara) and transformed into competent cells of Escherichia coli DH5α. The positive clones were sequenced by Beijing AuGCT DNA-SYN Biotechnology Co., Ltd. (Beijing, China).

Identification of 41 noncoding ORFs
In the process of hypothetical ORFs re-annotation, the 1536 ORFs with "putative," "probable," or "hypothetical" functions were re-identified based on 28 parameters, including 12 single nucleotide frequencies and 16 dinucleotide frequencies mentioned above. Firstly, the 3452 function-known genes were randomly divided into two equal parts. The former served as a training set to calculate the discrimination parameters, and the latter served as a test set to assess the accuracy of the algorithm. Both the training and the test sets should include positive and negative samples. In the genome of Xoo PXO99A, about 85% of the whole DNA sequences were coding and the remaining intergenic regions were dominated by structural RNA sequences, so it was difficult to prepare an appropriate set of negative samples. Thus, the following procedures were taken to produce negative samples. Each of the known genes was randomly shuffled 100,000 times, so that it was transformed into a random sequence. The shuffled sequences then served as negative samples. The performance of each test was measured by the following benchmark criteria, sensitivity (s n ), specificity (s p ), accuracy (Ac), and Matthew's correlation coefficient (MCC): where TP, TN, FP, and FN are fractions of true positive, true negative, false positive, and false negative predictions, respectively. After performing thirtyfold cross-validation tests, the mean sensitivity, specificity, accuracy and MCC, and the standard deviations were obtained ( Table 1). The average prediction accuracy was as high as 99.5%. Then the 1536 hypothetical ORFs in Xoo PXO99A were re-identified. A total of 41 hypothetical ORFs were recognized as noncoding by all the thirtyfold cross-validation tests, which are listed in Table 2.

Evidence of the 41 recognized ORFs as noncoding ORFs
Since protein-coding genes can fold into stable and functional proteins, many constraints were imposed on protein-coding sequences. In most prokaryotic and eukaryotic species, protein-coding genes showed similar base usage patterns in the first and second codon positions, while the base usage in the third codon position was species specific (Chiusano et al., 2000;Gupta, Majumdar, Bhattacharya & Ghosh, 2000;Trifonov, 1987). Generally, the base usage pattern was R GN type, where R, G and N denoted purine, nonguanine, and any bases ate the first, second, and third codon positions, respectively (Chen & Zhang, 2003). It was observed that the first and second codon positions were related to the biosynthetic pathway and protein secondary structures (Chiusano et al., 2000;Gupta et al., 2000). On the other hand, the negative samples were shuffled sequences of function-known genes (random sequences which have the same nucleotide composition as protein-coding genes), so the base frequencies at each 'codon' position were almost equal. The difference of the base usage between protein-coding genes and noncoding sequences forms the basis of the present method to distinguish the two types of samples.
The difference between coding and noncoding sequences can be intuitively viewed by principal component analysis (PCA). PCA defines the correlation among the variables of given data. The first derived direction was chosen to maximize the standard deviation of the derived variable and the second was to maximize the standard deviation among directions uncorrelated with the first, and so forth (Dillon & Goldstein, 1984). Figure 1 shows the distribution of points on the principal plane spanned by the first two principal components. The coding and noncoding sequences were represented by open circles and triangles, respectively. The first and second principal axes possessed 25.7% and 17.4% of the total inertia of the 28-dimensional space. It was observed that the two principal axes can separate the coding and noncoding sequences into two almost nonoverlapping clusters. The recognized noncoding ORFs were represented with filled stars, which distribute far from the core of function-known genes, and close to the random sequences, implying that the 41 ORFs listed in Table 2 were very unlikely to encode proteins.
The Clusters of Orthologous Groups (COGs) of proteins were involved in the RefSeq annotation. COG was a group of three or more proteins that were inferred to The figure following ± is the standard deviation. be orthologs, i.e. they have evolved from a common ancestor (Tatusov et al., 2003). Analysis of complete bacterial genomes showed that prokaryotic proteins were generally highly conserved, with about 70% of them containing ancient conserved regions shared by homologs from distantly related species (Tatusov et al., 2003). Therefore, an ORF within a COG was highly likely to be a protein-coding gene, which has homologs from other species. In the 3452 function-known genes, 2875 contained COG codes, however, none of the 41 recognized noncoding ORFs contained COG codes. Since most of the over annotation are short ORFs, the average length of the 3452 function-known genes in the first group and the 41 recognized noncoding ORFs were compared. The average length of the recognized noncoding ORFs (189 bp) was much shorter than that of the function-known genes (1012 bp, Table 3). In addition, the ab initio gene finding programs Prodigal and Fge-nesB have not identified the 41 ORFs. Based on these facts, it was highly possible that the 41 "hypothetical genes" annotated in Xoo PXO99A genome were not protein-coding genes but over-annotated short ORFs. To test our theoretical prediction, 10 noncoding ORFs were randomly selected to perform RT-PCR experiments. Information of the 10 noncoding ORFs and the designed primers are listed in Table 4 and the PCR results are shown in Fig. 2(A)-(C). The RT-PCR using total DNA as template confirmed that all the DNA segments could be amplified precisely (Fig. 2(A)). In Fig. 2(C), water was used as template for detecting if the total RNA sample had DNA contamination or not, and negative result was obtained. Fig. 2(B) showed the RT-PCR result of cDNA as template. It was observed that except for the positive control of 16S rRNA gene (for multicopy gene), all the tested noncoding ORFs were not successfully amplified, confirming that they were true noncoding ORFs. The DNA sequencing results confirmed that the PCR products were the correct target gene sequences (data not shown). The RT-PCR results verified that the theoretical prediction accuracy was very high.

Newly predicted protein coding genes not annotated in NCBI RefSeq
The ab initio programs employed in NCBI RefSeq annotation included Glimmer (Delcher, Bratke, Powers, & Salzberg, 2007), GeneMarks (Besemer & Borodovsky, 2005), and recently developed Prodigal (Hyatt et al., 2010). However, the Prodigal prediction result has not incorporated into the RefSeq annotation. FgenesB was another accurate gene prediction program not used in RefSeq annotation. In this analysis, Prodigal and Fge-nesB were used to find new protein-coding genes not present in RefSeq annotation in Xoo PXO99A genome. Prodigal and FgenesB predicted 4980 and 5074 ORFs in Xoo PXO99A genome, respectively. Since each gene finding program has its false-positive prediction, ORFs predicted simply by Prodigal or FgenesB were regarded as false-positive ORFs. The Venn Diagram of Prodigal, FgenesB, compared with JCVI CMR annotation is shown in Figure 3. The three systems predicted 4543 ORFs in common, 4346 of them were contained in NCBI RefSeq annotation, and the other 197 ORFs were not included. Since the above three systems commonly predicted these 197 ORFs, and the method proposed in this paper can also recognize these ORFs, they were highly possible to be protein-coding genes but missed in RefSeq annotation.
Then BLAST search was performed to find potential functions for these ORFs, the prediction results can be The second principal component (PCA2) The first principal component (PCA1) Figure 1. The distribution of points on the principal plane spanned by the first (x) and second (y) principal axes using the PCA in Xoo PXO99A. The open circles denoted functionknown genes, the open triangles represented the corresponding negative samples and the filled stars denoted the hypothetical ORFs recognized as noncoding sequences. The first and second principal axes accounted for 25.7% and 17.4% of the total inertia of the 28-dimensional space, respectively. Note that the distribution of the open circles was well separated from that of the open triangles, indicating that coding and noncoding sequences were well distinguished. Furthermore, most of the identified noncoding ORFs distributed far from the core of protein-coding genes, and closed to the random sequences, implying that the 41 recognized noncoding ORFs listed in Table II were very unlikely to encode proteins.
classified into several groups. Sixty-nine ORFs predicted to encode 25 transposases and 44 truncated transposases (Supplementary Table I). The original RefSeq annotation of Xoo PXO99A genome contained 765 transposases, adding the above new annotated transposases, Xoo PXO99A encoded more than 820 transposases, accounting for about 16% of its gene number, which was much higher than the average 1.1% transposase content in bacterial genome (Aziz, Breitbart, & Edwards, 2010). Twenty-nine ORFs encoding hypothetical proteins or had no sequence similarity in public databases (Supplementary Table II). Eighty-seven ORFs were predicted with concrete functions (Supplementary Table III). It was very surprising that many ORFs encoded truncated genes, and sometimes several neighboring genes could align to different part of the same homolog gene in other bacteria (the alignment part was listed in the "identity" column in Supplementary Table III). Comprehensive investigation showed that there were transposases flanking most of the truncated genes. Transposases were enzymes that bind to the ends of transposons and catalyze the movement of transposons to other parts of the genome by cut-andpaste mechanism or replicative transposition mechanism, without the need for homology between transposons and the new DNA target sites (Curcio & Derbyshire, 2003). As mentioned above, Xoo PXO99A contained about    In the RT-PCR with cDNA as templates, the positive controls of 16S rRNA gene (for multicopy gene) was obtained. The tested samples were all got negative amplification results as predicted.
(C) When the total RNA was used as templates in the PCR, the products were all failed for amplification. that might be involved in maintaining genetic diversity through genome rearrangements. Since these genes were truncated functional genes, most of them might not encode proteins. In addition, the other two sequenced Xoo strains, KACC10331 and MATF311018, contain 662 (16%) and 675 (15%) transposases, respectively. Salzberg et al. (2008) also stated that most of the rearrangements between Xoo PXO99A and MATF311018 were mediated by a diverse set of transposable elements. Table 5 listed the remaining potential 12 proteincoding genes, which had high sequence identity and alignment coverage with function-known genes. Thus, they were very likely to be new functional genes in Xoo PXO99A genome. All the 12 protein-coding genes were tested with RT-PCR experiments (Table 6), and the results were shown in Fig. 4(A)-(C). The RT-PCR using total DNA as template confirmed that all the DNA segments could be amplified precisely (Fig. 4  (A)). Fig. 4(C) showed that the total RNA sample had no DNA contamination. Fig. 4(B) showed that when cDNA as template, 11 (92%) of the tested genes were successfully amplified except PC-2. The RT-PCR results verified that most of the 12 predicted potential new genes were true protein-coding genes.

Conclusion
The present analysis focused on the re-annotation of plant pathogen Xoo PXO99A, which was the causal agent of bacterial blight on rice. Forty-one originally annotated hypothetical genes were recognized as noncoding sequences, which were supported by PCA, other theoretical evidence and RT-PCR experiment. In addition, 12 potential protein-coding genes were predicted with functions, and 11 (92%) were verified to be true proteincoding genes with RT-PCR experiment. Another important finding was that diversities among different Xoo strains were mainly caused by high amount of transposases. The improved annotation of Xoo PXO99A genome will benefit the research of lifestyle, metabolism, and pathogenicity of this important plant pathogen.  , and PC-12(305 bp). (B) In the RT-PCR with cDNA as templates, the positive control of 16S rRNA was obtained. 11 samples were all specifically amplified except the PC-2. (C) When total RNA was used as templates in the PCR, the amplification results were all negative.