Computational Prediction and Functional Annotation of Enzymes in the Haloacid Dehalogenase Superfamily for Bioremediation

Halogenated organic compounds are serious environmental pollutants that are difficult to eliminate from the soil and groundwater. Some enzymes in the Haloacid Dehalogenase (HAD) superfamily possess the ability to detoxify and degrade halogenated compounds. Unfortunately, the majority of the members of this superfamily, along with most of the 13,000+ Structural Genomics protein structures in the Protein Data Bank (PDB), have unknown biochemical function or an incorrect or uncertain putative function. In order to transform genomics data into useful benefits for humankind, reliable methods must be developed to functionally annotate these proteins in order to identify potential applications, such as bioremediation. Two computational methods developed at Northeastern University are used to predict biochemical function for proteins of unknown or uncertain function in the HAD superfamily. These methods are Partial Order Optimum Likelihood (POOL) and Structurally Aligned Local Sites of Activity (SALSA). POOL is a machine learning method that uses the electrostatic features and metrics from THEoretical Microscope Anomalous TItration Curve Shapes (THEMATICS), ligand binding pocket geometric features from ConCavity, and the evolutionary scores from phylogenetic trees from INformation-theoretic TREe traversal for Protein functional site IDentification (INTREPID) to make the predictions of which residues are catalytically active or otherwise important for protein function. SALSA uses the functional residue predictions obtained from POOL and assigns function to Structural Genomics proteins according to the local spatial arrangement of predicted residues at the active site. So far for the Structural Genomics proteins in the HAD superfamily, using SALSA we predict one dehalogenase, eight sugar phosphatases, three NagD-like phosphatases, four P-Type ATPases, and four soluble epoxide hydrolases. These predictions will be experimentally validated by direct biochemical assay to establish the function of each protein and to verify our computational approach to protein function prediction. The ability to predict computationally the biochemical function of protein structures of unknown or uncertain function adds tremendous value to genomics data.