1471-2105-6-88-3.jpg (45.67 kB)

The dictionary was constructed by considering the classification results of a particular term in different articles

figure

posted on 2011-12-30, 23:43 authored by Lei Shi, Fabien Campagne

Copyright information:

Taken from "Building a protein name dictionary from full text: a machine learning term extraction approach"

BMC Bioinformatics 2005;6():88-88.

Published online 7 Apr 2005

PMCID:PMC1090555.

Step 1: we filtered out terms that were predicted to be a protein in less than 75% of the articles where a prediction was made. For example, if term A appears in 4 articles and is classified as a protein name in 3 of them, term A is accepted in the dictionary. This process collected 61,312 terms. Step 2: we removed terms with two characters or less. Step 3: to remove ambiguity with protein names that are also common nouns, we filter the dictionary against the Webster's Revised Unabridged Dictionary (G & C. Merriam Co., 1913, edited by Noah Porter, provided by Patrick Cassidy of MICRA, Inc, and retrieved from ). We estimate that this edition contains about 80 common protein names (e.g., amylase). Step 4: we filter the dictionary against species names from the NCBI taxonomy database [30].