The dictionary was constructed by considering the classification results of a particular term in different articles
Taken from "Building a protein name dictionary from full text: a machine learning term extraction approach"
BMC Bioinformatics 2005;6():88-88.
Published online 7 Apr 2005
Copyright © 2005 Shi and Campagne; licensee BioMed Central Ltd.Step 1: we filtered out terms that were predicted to be a protein in less than 75% of the articles where a prediction was made. For example, if term A appears in 4 articles and is classified as a protein name in 3 of them, term A is accepted in the dictionary. This process collected 61,312 terms. Step 2: we removed terms with two characters or less. Step 3: to remove ambiguity with protein names that are also common nouns, we filter the dictionary against the Webster's Revised Unabridged Dictionary (G & C. Merriam Co., 1913, edited by Noah Porter, provided by Patrick Cassidy of MICRA, Inc, and retrieved from ). We estimate that this edition contains about 80 common protein names (e.g., amylase). Step 4: we filter the dictionary against species names from the NCBI taxonomy database .