posted on 2025-01-28, 17:35authored byConnor O’Ryan, Kevin D. Hayes, Francis G. VanGessel, Ruth M. Doherty, William Wilson, John Fischer, Zois Boukouvalas, Peter W. Chung
In 2020, nearly 3 million scientific and engineering
papers were
published worldwide (White, K. Publications Output: U.S. Trends And
International Comparisons). The vastness of the literature that already
exists, the increasing rate of appearance of new publications, and
the timely translation of artificial intelligence methods into scientific
and engineering communities have ushered in the development of automated
methods for mining and extracting information from technical documents.
However, domain-specific approaches for extracting knowledge graph
representations from semantic information remain limited. In this
paper, we develop a natural language processing (NLP) approach to
extract knowledge graphs resulting in a semantically structured network
(SSN) that can be queried. After a detailed exposition of the modeling
method, the approach is demonstrated specifically for the synthetic
chemistry of organic molecules from the text of approximately 100,000
full-length patents. In this paper, we focus specifically on characterizing
the knowledge graph to develop insights into the linguistic patterns
and trends within the data and to establish objective graph characteristics
that may enable comparisons among other text-based knowledge graphs
across domains. Graph characterization is performed for network motif
structures, assortativity, and eigenvector centrality. The structural
information provided by the measures reveals language tendencies commonly
employed by authors in the text discourse for chemical reactions.
These include observations of the prevalence of descriptions of specific
compound names, that common solvents and drying agents cut across
large numbers of chemical synthesis approaches, and that power-law
trends clearly emerge in the limit of larger corpora. The findings
provide important quantitative characterizations of knowledge graphs
for use in validation in large data settings.