Toward Semantics Empowered Biomedical Web Services

caGrid has accumulated a repository of biomedical services, however, how a cancer researcher can find proper services in the caGrid when needed remains a big challenge. This research aims to enhance the cyber infrastructure of caGrid, by developing a mechanism that turns caGrid services into semantic-aware interoperable services. We proposed a service semantics model, and developed a technique that automatically extracts semantic metadata from static WSDL service descriptions. Such semantic information is stored as loosely coupled annotations that can be queried using semantic Web techniques, to enhance services discovery and composition. We also proposed a two-phase discovery technique that helps users quickly identify interested service operations. This paper also reports our examinations over available techniques and recommends a feasible infrastructure for biomedical service reuse. A prototyping system is developed as a proof of concept.


I. INTRODUCTION
Services computing technique has enabled scientists to expose data and computational resources as Web services.As shown in Fig. 1, scientists create services (data, code, instructions) and publish them on the Internet using the machine understandable Web Service Description Language (WSDL).Other scientists may discover the services and decide whether or which one to use.They may also compose multiple services to create a new experimental process (scientific workflow [1]).If successful, the scientific workflow will be published as a new service for other scientists to reuse.Such a virtual cycle can be envisioned as a new paradigm in science: service-oriented science, or e-Science [2].
Life science is one of the disciplines that pioneer in the trend of e-Science.When the National Cancer Institute (NCI) launched the initiative of cancer Biomedical Informatics Grid (caBIG) project, one key strategy was to leverage the services computing technology to connect the entire cancer community to accelerate cancer research.
To help life scientists publish and discover scientific services, centralized repositories and registries have been established.One known repository is the BioCatalogue [3], which has accumulated more than 1,700 biomedical services to date.The caBIG project has also created caGrid as a service repository.However, it has become a big challenge for life scientists to understand the thousands of available services and select appropriate ones to facilitate their own workflow design.Our recent analytical study revealed that only a small number of utility services at BioCatalogue (e.g., Blast, http://xml.nig.ac.jp/wsdl/Blast.wsdl,an application of comparing genome sequences) are frequently used by life scientists to build scientific workflows [4].
Our survey from various caBIG projects exposed three reasons that may lead to this phenomenon.First, life scientists are not aware of the available services.Second, life scientists struggle with how to operate a service (e.g., to conform to its data and operation formats).Third, life scientists are unable to gather all relevant knowledge to best understand the services.
NCI thus formed a project "Semantic Workflows," aiming to explore a way to facilitate life scientists in building workflows from reusable services.The Center for Biomedical Informatics and Information Technology (CBIIT) at NCI decided to adopt the Healthcare Level 7 (HL7) Services Aware Interoperability Framework (SAIF) [5] to enable the development of domain software components as Working Interoperability (WI).
Toward this ultimate goal, our first step is to investigate to apply Semantic Web technology [6] to enhance artifact discovery and composition.Here the term artifact refers to either a service or a workflow.Our rationale is as follows.1) The existing artifact publication techniques (e.g., WSDL files) only provide syntactical invocation interfaces for the artifacts.2) Semantic information, referring to the meaning of data and functions, should help scientists better understand available artifacts (e.g., what are the best ways to use them; are there any constraints, etc.), thus help scientists better leverage existing artifacts.3) The Semantic Web community has created a wealth of methods, standards, and technologies to create, store, analyze, and query machine-understandable meta-data (semantics) over Web information [6].4) Therefore, we shall investigate how to leverage the Semantic Web technologies to improve artifact discovery and composition, in the context of caBIG and life science research.It should be noted that our research approaches may be easily expanded to other areas not This paper reports our on-going efforts of automatically extracting semantic information from available biomedical services toward making them "computable semantic interoperability" (CSI).Our contributions are four-fold.1) We proposed a service semantics model and built techniques to automatically extract semantic information from published service documents and annotate services.2) We established a two-phase search technique that leverages structural and semantic information to quickly identify related operations.3) We suggest a feasible service semantics extraction and generation infrastructure compatible with existing standards and techniques.4) We built a prototyping service search engine as a plugin to a known scientific workflow system.
The remainder of the paper is organized as follows.In Section 2, we discuss related work.In Section 3, we introduce our service semantics model and an automatic semantics extraction technique.In Section 4, we present our two-phase services discovery technique.In Section 5, we present our proposed system infrastructure and implementation details.In Section 6, we present experiments and discussions.In Section 7, we make conclusions.

II. RELATED WORK
Segev and Zheng [7] propose an ontology bootstrapping method that automatically generates concepts and their relations in a domain from WSDL files.In contrast to their work, we focus on clustering services and service operations to facilitate services discovery.
Semantic Automated Discovery and Integration (SADI) [8] framework is able to recommend SADI services based on input or output data types.However, a SADI service has to be created manually, built and deployed as a servlet, and then registered to the SADI registry, before it can be searched by the SADI search engine.In contrast, our search engine can cover normal WSDL services.Furthermore, while SADI search only compares the input/output OWL class URLs in registered SADI services, our work considers more semantic conditions (e.g., functional profile, pre-and post-conditions, and constraints).
Zhang and Li introduce the concept of service cluster [9] to represent a collection of available services provided by multiple service providers to perform a specific common function.Here we use the concept to represent a collection of functionally relevant services based on domain-specific ontology.We use clustering algorithms to automatically identify service clusters from their published documents.
The WINGS [10] project adopts AI planning and semantic reasoners to verify whether a workflow complies with the requirements of the comprising components and datasets.The Kepler [11] workflow management system provides ontology-driven search capability for data and actors that have been annotated with formal ontologies.In contrast, we focus on using automatically extracted semantic metadata to enhance services discovery.
There have been a lot of efforts on semantic services discovery, most of which performing profile-based service signature (I/O) matching [12].OWLS-MX [12] and WSMO-MX [13] propose to combine logic-based reasoning and syntactic concept similarity computations in OWL-S.Sbodio et al. [14] propose to use SPARQL as a formal language to describe the pre-and post-conditions of services.Junghans et al. [15] propose a practical formalism to describe functionalities and service requests.In contrast, we leverage service functional profiles and service operation structures to enhance services discovery.

III. SEMANTICS MODEL AND AUTOMATIC EXTRACTION
As shown in Fig. 2, we propose a service semantics model comprising both static semantics and behavioral semantics.Static semantics describe the functionalities that a service promises to provide, as well as the goals of the service.Behavioral semantics describe the required circumstances when a service can behave, including input and output parameters, pre-and post-conditions, constraints, and historical usage patterns.
The model defines an ontology that describes the semantics of a service.From service provider perspective, it guides how to depict a service; from consumer perspective, it facilitates effective services discovery.We have developed a technique that can automatically extract various aforementioned semantic metadata from a published WSDL service.

A. Static semantics
The goals of a service are usually implied by embedded comments; therefore, we focus on how to extract the functional profile of a service from its WSDL file.Our hypothesis is that, user-defined names used in a WSDL document may depict a functional profile of the corresponding service.The rationale is that, service developers tend to follow naming conventions and use meaningful words to name operations and services.Furthermore, when an IDE automatically generates a WSDL document from source code where naming convention is typically strictly enforced, the actual method names will be used for generating corresponding WSDL interfaces.As shown by the following example, a method named "add" in a Java class Calculator will cause all related WSDL segments to be named after it (WSDL 1.1 generated by Java Eclipse IDE), including portType name, operation name, input/output message names, and part names inside of input/output messages.<wsdl:message name="addResponse"> <wsdl:part element="impl:addResponse" name="parameters"/> </wsdl:message> <wsdl:message name="addRequest"> <wsdl:part element="impl:add" name="parameters"/> </wsdl:message> <wsdl:portType name="Calculator"> <wsdl:operation name="add"> <wsdl:input message="impl:addRequest" name="addRequest"/> <wsdl:output message="impl:addResponse" name="addResponse"/> </wsdl:operation> </wsdl:portType> Therefore, we define the functional profile of a service s as follows.It is a combination of service name, portType name and all comprising operations' information, each in turn including the name of the operation, names of the input/output messages and names of all comprising parts of the messages.
We thus obtain a service functional profile as a document containing all the names extracted from its WSDL file.The resulting profile may comprise duplicated words.

B. Behavioral semantics
Our hypothesis is that, a service, especially a scientific service, may require a suitable execution context for the best result.For example, our past analysis found that the performKNN1 service (a machine learning method K-Nearest Neighbor) usually performs well when integrated with preprocessing services provided by the same research group 2 .Therefore, we highlight the importance of extracting behavioral semantics of a published service.Such metadata shall benefit the precision level of services discovery, because they can be used to compare with service requestors' residing contexts [16].In other words, by automatically extracting behavioral semantics of published services and advertising their expected behaviors, we escalate the interoperability of services.
As shown in Fig. 2, we identify four categories of behavioral semantics.Our categorization is compatible with the HL7 Behavioral Framework Metamodel [5] that defines the metadata needed for workflows in NCI CBIIT.
I/O parameters refer to input and output data types.At the current time, we consider exact matching of data types defined in WSDL files or referred XML Schema files.
According to XML Schema, all 19 built-in primitive types and 27 built-in derived types can each be associated with a set of constraining facets.Our study yielded a matrix of allowable constraining facets for all XML built-in types [17].In this project, we leverage our earlier work and search for constraints defined by service providers in the format of constraining facets in WSDL files.
Our another earlier study revealed that historical usage data may be leveraged to increase the effectiveness and efficiency of services discovery [18].For example, assume that historical data shows two services (s 1 and s 2 ) are always used together in common workflows.This knowledge indicates that, if a scientist selects one such service (say s 1 ), the other service should be recommended to the scientist.The usage pattern metadata intends to record such best practice from past experiences.Our technique of automatically mining services-oriented past usage patterns is reported in another paper [4].
Note that behavioral semantics may help to enable the computationally assisted assembly of services using business-level guidance in addition to technical-level guidance (e.g., syntactical service match making).As the first step, here we focus on developing a notion of "just enough semantics."

C. Semantic annotations
Based on our proposed service semantics model, all metadata will be automatically generated from service sources (WSDLs).We record such semantic metadata in the form of annotations to further facilitate services discovery.
We face two options to store annotations for a service: a tightly coupled way or a loosely coupled one.The first option is to insert annotations into the original WSDL file.The second option is to store annotations in a separate document, while inserting a link into the original WSDL file to refer to the annotation document.Each option has a W3C Fig. 3 Loosely coupled semantic annotation.project supporting it: SPARQL Annotations in WSDL (SPDL) project [19] for the former and Web Service Modeling Ontology (WSMO) [20] for the latter.
In contrast, WSMO introduces Semantic Annotations for WSDL (SAWSDL) for inserting annotation reference links.
The following example shows that an annotation file "elementRef" can be accessed by the URI, using Web Service Modeling Language (WSML), a SAWSDL extension: <xs:schema elementFormDefault="qualified" targetNamespace="http://www.w3.org/2002/ws/sawsdl/spec/wsdl/order#"> <xs:element name="add" sawsdl:modelReference="http://www.example-one.org#elementRef">We consider a loose coupling solution is important to offer higher reusability and flexibility in this project.Especially, Web services are hosted and maintained by corresponding service providers.As shown in Fig. 3, the WSMO approach allows us to keep the registry of our search engine clean, since we do not have to change the WSML-annotated WSDL files when annotations get changed, e.g., new usage patterns are identified and recorded into the annotation file of a service.
IV. TWO-PHASE SERVICES DISCOVERY We propose a two-phase technique for services discovery at the search engine.A published service usually offers a set of operations, providing similar functionalities but serving slightly different scenarios.For instance, two operations in a service may offer exactly the same functionality but require different numbers of input arguments or different input data types.This observation is extremely important for biomedical services discovery.Consider a life scientist who holds some KEGG data at hand, and intends to find an available service that can take the KEGG data type as input to conduct data analysis.In such a circumstance, we aim to identify proper operations, instead of stopping at the service level.
Meanwhile, performance of services discovery remains a significant issue because of the size of the search space.For example, the BioCatalogue repository has accumulated over 1,700 biomedical services.
We thus developed a two-phase approach.In phase one, we aim at quickly locating a group of related services (a service cluster comprising relevant services from the original large-scale service set) to largely narrow down the search space.In phase two, we aim at finding proper operations inside of the service cluster based on semantic context.

A. Service clustering
To find a proper service cluster in a service repository, we apply the information retrieval technology to divide services into clusters based on their functional similarities.In contrast to the existing service categorization approaches that depend on the keywords provided by service providers, our approach aims to automatically divide services into clusters based on their static semantics (i.e., functional profiles) that can be automatically extracted from their published files.The following pseudo code shows how we cluster services in a registry: We use their functional profiles to evaluate the similarities among services.As the process shown in Section III.A, the functional profile of a service is constructed as a document containing a list of names (terms).
Each name obtained has to go through a normalization and a weighting process before it can be used for further analysis, as shown in steps 3-6.A name is an identifier that is either a single English term or a composite one.For Fig. 4 Structure of a service operation.example, the name of an operation may be preProcessData or get_Array_Data.Such a composite term can be divided into single terms: by identifying big-case characters and removing separators for the above two situations, respectively.Another issue is the synonym issue.For the same meaning, one may choose to use syntactical variants, such as plurals, past tense suffixes, and gerund forms.We chose to partially solve this problem by substituting names with their respective stems, the portion of a word left after the removal of its affixes.We adopted the Porter stemming algorithm [21] for prefix and affix removal and Wordnet (http://www.wordnet.princeton.edu) for solving the synonym issue.Going through all the names in a functional profile of the service, we get a normalized document.
In step 5, by applying the term frequency-inverse document frequency (TF-IDF) algorithm [22], we calculate the weight (w j,i ) of every term (t j,i ) inside of the functional profile of every service (s i ).It indicates the importance of the term representing the semantics of service.
Where ‫݂ݐ‬ , is the term frequency (obtained by the number of occurrences of the term t j,i in the functional profile of service s i , divided by the size of the functional profile of the service to ensure that ‫݂ݐ‬ , ∈ [0,1] ); ݂݅݀ ‫ݏ݅‬ ‫݄݁ݐ‬ general importance of the term (obtained by dividing the total number of services by the number of services containing the term in their functional profiles, and taking the logarithm of the resulting quotient).Meanwhile, duplicated terms will be removed.In other words, the result will be a map, each key being a term and corresponding value being its importance that represents the functionality of the service.
Although the concept relatedness between terms is broader than that of similarity [23], we only consider the latter for simplicity.We then leveraged the shortest pathbased LCH algorithm [24] to measure the semantic similarity between two terms based on the lexical database WordNet.
‫ݐ‪݅݉൫‬ݏ‬ : ܵܶ, ‫ݐ‬ : ܵܶ൯ ൌ ݈݄ܿሺ‫ݐ‬ , ‫ݐ‬ ሻ Now we are ready to calculate the similarity between two lists of elements (terms) in step 8. Two lists can be abstracted as two disjoint sets of elements X and Y: ‫ݔ∀‬ ∈ ܺ, ‫ݕ‬ ∈ ܻ, an edge always exists ൫‫ݔ‬ , ‫ݕ‬ ൯ ∈ ‫,ܧ‬ with a weight of ‫ݔ‪݅݉ሺ‬ݏ‬ , ‫ݕ‬ ሻ .Thus, we obtain a weighted complete bipartite graph ‫ܩ‬ ൌ ሺܸ ൌ ሺܺ, ܻሻ, ‫ܧ‬ሻ.Calculating the similarity between the two lists is therefore turned into finding a perfect match (maximum cardinality matching), where the sum of the weights of the edges in the matching reaches a maximal value: where M1 and M2 are two mapping functions, each selecting the number of min(|X|,|Y|) elements from the sets X and Y, respectively: We applied the Hungarian algorithm [25] to find the optimal match, with a cost of O(V 2 E).
We calculate the similarity factor over functional profiles between each pair of services, and obtain a similarity matrix for all services residing in the registry.If the similarity between a peer satisfies ‫ݏ‪݅݉ሺ‬ݏ‬ ଵ , ‫ݏ‬ ଶ ≥ ߛ), where γ is a preset value, they are put into the same service cluster.Without losing generality, we set γ as 0.75 and apply to the registry to identify service clusters.

B. Service operation clustering
We leveraged the behavioral semantics to cluster service operations in a service cluster.Right now we used the internal structure of service operations to cluster them based on their similarity.Fig. 4 illustrates the building blocks of a service operation as well as the relationships between them, using a UML class diagram.An operation comprises one input message and one output message.Each message may contain multiple parts, each comprising an attribute representing its data type that can be either an XML built-in type or a user defined type (a complex type that is defined recursively).As shown in Fig. 4, each comprising element contains an attribute declaring the name of the element.
Based on the structure of a service operation, we have developed a similarity computation algorithm for comparing two operations.Note that we only need to consider two operations belonging to two services residing in the same service cluster.Given two operations { ‫‬ , ‫‬ ∈ ‫ܿݏ‬ }, their similarity can be calculated using the following formula.The coefficients indicate that we may assign different weights for operation names and messages, respectively.Each operation name is normalized into a list of terms using our method discussed in Section III.A.
In turn, the following formula compares the similarity between two messages (either input or output messages).The coefficients indicate that we may assign different weights for  where w 1 +w 2 =1.Each message name is normalized into a list of terms using our method discussed in Section III.A.Each message may contain a list of parts.
Calculating similarity between two lists of terms (names) or two lists of parts is more challenging because the two lists may contain different numbers of elements.We first calculate two parts only.
The following formula compares the similarity between two parts, belonging to two different messages.The coefficients indicate that we may assign different weights for part names and data type, respectively.where w 1 +w 2 =1.
For simplicity, here we only compare the names (strings) between data types.Considering containment relationships between data types will be our future work.
We have discussed the algorithm that compares the similarity between two terms in the last section ‫ݐ‪݅݉൫‬ݏ‬ : ܵܶ, ‫ݐ‬ : ܵܶ൯.We also have discussed how to calculate the similarity between two lists of names.Calculating two lists of parts is similar, as we formulate the problem as finding a perfect match in a weighted complete bipartite graph ‫ܩ‬ ൌ ሺܸ ൌ ሺܺ, ܻሻ, ‫ܧ‬ሻ, where the sum of the weights of the edges in the matching reaches a maximal value: The only difference is that the importance of each element (i.e., part) here is equal.We use the approach to calculate the similarity between names (operation, message, and part element) and parts.As a result, we can calculate the similarity between any two operations in one service cluster.
Then we apply a hierarchical clustering algorithm [26] to organize all operations in the same service cluster into a multi-level cluster.Each cluster represents an abstract operation providing similar semantic services with a similar structure.Searching in such a hierarchical structure, we can help scientists quickly identify proper service operations.

V. SYSTEM INFRASTRUCTURE AND IMPLEMENTATION
Fig. 5 illustrates the infrastructure of our semanticsempowered services search engine, and the control flow among its comprising components.A number of open-source libraries are leveraged in our implementation.To make it easier for audience to read, we summarize the libraries in Table 1, with their abbreviations, published groups, and either their full names or short descriptions.We integrate our approach into Taverna workbench [27] (a known workflow management system in life science), and since Taverna is developed in Java, all of our selected open-source libraries are Java-based.Our search engine is implemented as a plugin to the Taverna workbench where users can conduct SPARQL queries.Upon receiving a query, the search engine will rank available services, through evaluating their WSDL files as well as associated annotation files.
As shown in Fig. 5, WSDL files and their associated annotations (in the format of OWL/RDF) are stored separately for higher flexibility, linked through their intermediate SAWSDL-annotated WSDL files.As a matter of fact, all three parties are stored separately, while the original WSDL files are maintained by the corresponding service providers.The mappings among the three categories of files are maintained by our search engine, and remain to be kept up-to-date by monitoring whether changes are performed on the original WSDL files.
In case that a new WSDL file is published or an existing WSDL file is updated, our system will regenerate its annotations.As shown in Fig. 5, based on whether the service description file is compatible with WSDL 1.1 or WSDL 2.0, one of the two paths will be selected.Although WSDL 2.0 has already been recommended by W3C since June 2007, we found that many available biomedical services remain to be WSDL 1.1 compatible.Furthermore, many existing Java development environments (e.g., Eclipse and NetBeans) only support WSDL 1.1.As a result, we have to Table 1.Summary of open-source libraries used.performance.Therefore, we intentionally simplified and revised some implementations from the libraries.For example, WSIF's schema parser, located at the package org.apache.wsif.schema.Parser, requires a WSDLLocator (an interface) to read in a WSDL URI.The implementation that is defined in WSIF uses classes that are not necessary for our project (e.g., ClassLoaders); therefore, we implemented our own WSDLLocator.

VIV. CONCLUSIONS AND FUTURE WORK
In this paper, we reported our on-going efforts of building a semantic-aware biomedical services discovery engine.Through automatic sematic metadata extraction from WSDL files and annotations, our approach helps to quickly identify related service operations.Our work also suggests a feasible infrastructure that is compatible with existing standards and techniques.Our prototyping system has demonstrated that our semantic infrastructure is able to fulfill the goal of next-generation caGrid 2.0: make easy things easy, lower the barrier to entry, and support existing users of the present infrastructure.
We plan to continue our research in the following directions.We will explore a notation that formally represents an abstract operation in the operation-level ontology and develop a technique that automatically generates abstract operations in the structure.We will also conduct case studies over real biomedical use cases and evaluate the effectiveness and efficiency of our approach.