GRACE Approach to Information Retrieval


GRACE is a comprehensive information retrieval system tailored specifically to the needs of the researchers in just any scientific field. GRACE was designed as a unique knowledge platform that offers a wide range of functionalities by maximizing the use of the existing resources. Rather than requiring additional investments into the computing infrastructure, GRACE allows research communities to get organized around their specific common interests, and share their computational, storage and knowledge resources according to their specific information needs. For that purpose GRACE uses the cutting‑edge technologies, and introduces several ground‑breaking innovations into the information retrieval.

Knowledge Domains

At the heart of the GRACE innovative approach lays the concept of the Knowledge Domain. Knowledge Domain is a complete virtualization of multiple content sources by means of a knowledge representation system (ontology) that reflects the conceptualization of the reality that is specific to a particular group of end‑users. These three crucial aspects: information resources, ontology, and the community of users are fused in GRACE into a single conceptual frame named the Knowledge Domain.

The concept of Knowledge Domain allows GRACE to introduce a highly innovative approach to the information retrieval from multiple content sources in parallel. This kind of information retrieval is typically referred to as meta‑search, and is included today into the standard offer by all commercial Knowledge Management systems. The main problem hereby is the integration of the search results from various content sources that may support differing ranking strategies [1] , querying syntaxes, meta‑data schemas etc.

The typical solution to this problem is in the post‑retrieval processing of the retrieved documents during which they are normalized to a canonical format (including the common document format, meta‑data schema, etc.), and than integrated into a single presentational structure, either by re‑ranking them uniformly, or by categorizing them into a pre‑existing classification schema, or by creating an ad hoc set of cluster labels automatically extracted from these documents.

In contrast to the existing systems that perform post‑retrieval integration of the search results, GRACE pre‑indexes the documents from the multiple content sources by employing ontology engineering. This is accomplished in three steps:

  1. The concepts from the underlying ontology are used as the queries that are systematically submitted to the multiple content sources in order to retrieve the relevant documents. This submission is repeated periodically, in order to ensure the frequent update of the GRACE index.
  2. The retrieved documents are analyzed, and according to their lexical content they are assigned to various concepts in the underlying ontology.
  3. The ontology is than used for querying and browsing as an index, which points to the relevant documents, regardless from the content source from which they originated. Unlike regular document indices based merely on terminology, GRACE index is concept‑based, and allows querying and browsing based on the semantic relationships between the concepts.

This use of ontologies for information retrieval is unique to GRACE. There are currently several meta‑search engines that are either web‑based or offer corporate solutions, such as Copernic [2] , Vivisimo [3] , WebBrain [4] , KartOO [5] , Grokker [6] , VirtualSelf [7] , and Leximancer [8] , to mention only the most prominent and interesting of them. However, these meta‑search engines access multiple content sources only on demand by merely distributing the user’s query to several content sources in parallel. They all perform integration of the search results as post‑retrieval processing, rather than as pre‑indexing, and none of them makes use of ontologies.

In contrast, corporate search engines, such as Northern Light [9] , or Overture’s [10] FAST and AltaVista, pre‑index documents from multiple content sources. In this respect, they are similar to GRACE. Nevertheless, these corporate search engines only offer integration of the local document management system with their own global index. Unlike GRACE, these search engines do not integrate other, external content sources. Furthermore, the integration of the local document system with their global index is simply accomplished by applying a common ranking strategy, and in contrast to GRACE, again no use of ontologies is made.

Commercial Enterprise Information Portals (EIP), such as Verity’s Ultraseek [11] that combines Inktomi’s indexing and search engine with the Infoseek’s classification and categorization engine, and Convera’s RetrievalWare [12] (previously Excalibur) categorize documents into pre‑defined classification schemas and taxonomies. RetrievalWare’s commercial offer includes various domain‑specific taxonomies (e.g., MeSH) as the so‑called cartridges that can be plugged‑in according to the specific user’s information needs.

This approach is indeed conceptually close to the GRACE’s Knowledge Domains, although it is nevertheless limited to the documents that are directly accessible for pre‑indexing. Unlike GRACE, EIPs are designed strictly for the use on the corporate intranets, and not for the integration of multiple external content sources whose content is often not directly accessible, but must be rather retrieved by querying them.

Accordingly, GRACE’s unique approach efficiently utilizes the ontology engineering in order to systematically harvest multiple content sources for relevant information, organize this relevant information following a particular conceptualization of the reality depicted by a knowledge representation system (ontology), and expose it to the end‑user for querying and browsing that are based on the underlying semantics.

Grid Technology

Pre‑indexing of the unstructured text by utilizing ontologies is a computationally intensive, "text crunching " task that includes various aspects of natural language processing. This makes the Grids, and in particular the Data Grids attractive for the information retrieval. After all, this is precisely what the Grids offer through the resource sharing: computational and storage resources that significantly exceed the resources contributed by any single user.

This is why GRACE is the first information retrieval system that utilizes the emerging Grid technology allowing communities of researchers in various scientific fields to meet their specific information needs by sharing their computing resources. GRACE is based on the Grid in order to omit further investments into the computing infrastructure, and allow users to maximize the utilization of their existing resources.

A similar approach in the field of the database federation was taken by the IBM Masala platform. IBM Masala utilizes DB2 Information Integrator to access distributed databases across a corporate network, and combine data from various sources to be used in data warehousing applications.

IBM Masala utilizes the specific type of the Grid infrastructure called the Data Grid. Data Grids are designed to share not only the computational resources, but also the storage space. It is hereby assumed that the data stored on the Grid are used by multiple users in a corporate network, and that it is, consequently, more efficiently to store data in a central position, rather than to repeatedly transport data over the network.

While IBM Masala focuses on the database federation, GRACE is specifically designed for the information retrieval from distributed content sources. Information retrieval refers to retrieval of unstructured, textual information, usually stored in various document formats, as opposed to the numerical data stored in the structured databases. Just as IBM Masala offers integration of structured data from distributed databases, GRACE offers integration of unstructured information from distributed content sources. Both of these solutions utilize the state‑of‑the‑art Data Grid technology in order to accomplish this task.

Accordingly, GRACE is currently the only application of its kind, and complements, rather than competes with IBM Masala, since it uses similar technology, and offers similar functionalities, but for a different type of information.

Information Retrieval on the Grid

From the user’s point of view, the Knowledge Domain is an ultimate virtualization of the multiple, distributed content sources. The Knowledge Domain does not only serve as the single access point to multiple content sources in parallel. By employing the cutting‑edge ontology engineering it also allows highly sophisticated and proficient information querying and browsing that is based on a particular conceptualization of the reality on the underlying semantics of the retrieved information. For the end‑user, a Knowledge Domain is thus the ultimate way to access the knowledge.