SANDRA Ontology Importer is designed to transform ontology concepts into an internal format used by the Search AND Retrieval Application (SANDRA) for concept‑indexing. In order to effectively use the Ontology Importer, it is necessary to first explain what is concept‑indexing, and how SANDRA works.
Concept‑indexing is a text‑processing procedure by which free‑text document content is systematically scanned in order to detect terminology used to express various ontology concepts. If a term expressing an ontology concept is detected in the document text, this document is associated (“indexed”) with that particular ontology concept.
Concept‑indexing is different from regular keyword indexing, since keyword indexing records the document association with every single word encountered in that document, while concept‑indexing records only the associations with a predefined set of terms (the so‑called “controlled vocabulary”, which is provided by the otology).
Ontology consists of concepts and their relations. A concept can be expressed with one or more terms. For example, the concept of a subatomic particle that is not part of the nucleus, and has a very small mass and a negative electric charge is expressed in English by the term “electron”. In multilingual ontologies, or in a thesaurus‑like ontology the same concept may be actually expressed with multiple terms.
A term expressing an ontology concept can consist of multiple words. For example, the term “shock wave” consists of the words “shock” and the word “wave”. In this case the difference between the keyword indexing and the concept‑indexing becomes obvious. It is not enough that a document contains the words “shock” and “wave”, but the exact term “shock wave” must occur in it in order to associate it with the corresponding concept.
Accordingly, ontology concepts are expressed by terms, and terms consist of words.
Concept‑indexing cannot be performed as simple string matching. In order to cater for the enormous morpho‑syntactic diversity of the natural language, it is necessary first to normalize the original document text to a canonical format consisting of the base‑forms, rather then the original words. This normalized document format is then compared to the controlled vocabulary, which was normalized exactly in the same way.
SANDRA processes the normalized document format by running it through a finite‑state automaton. Finite‑state automaton is a well‑known algorithm that keeps track of its internal state. At each processing step, the algorithm performs desired calculations. Which calculations will be performed depends on the current state of the automaton. The next state of the automaton is then set accordingly to the outcome of these calculations, and the entire procedure is repeated. When a certain, predefined state is reached (the so‑called “final state”), the results of the calculations are provided, and the automaton is reset to its initial state.
Finite‑state automaton allows SANDRA to efficiently detect complex terms consisting of multiple words. SANDRA loops through the normalized text format comparing each word to a database containing pre‑processed terminology. If the current word matches one of the words in the database, the expected state recorded in the database is compared to the current state of the automaton. For example, if the first word of a multiple‑word term is encountered, it is expected that the finite‑state automaton will be in its initial state.
Accordingly, the terminology expressing the ontology concepts must be pre‑processed in such a way that each word is associated with a series of possible states, which are expected to be encountered in the original document text. This is achieved with SANDRA Ontology Importer.
This series consists of pairs of integers – the first in each pair specifies the possible state in which it is expected to encounter the corresponding word, and the second specifies what will be the state into which the finite‑state automaton must be next set, but only if that word was indeed encountered in the state specified by the first integer. Such pair of integers is, accordingly, called “state transition”.
For example, if the word “electron” was encountered in the document text when the finite‑state automaton was in its initial state (“0”), and the state transition associated with the word “electron’ is “0 -> 1”, then the finite‑state automaton will be automatically set into the state “1”.
The last word of every term contains state transition specifying the finale state. This final state contains also the Global Unique Identifier (GUID) of the corresponding ontology concept, as well as of the corresponding ontology. GUID is a string the length of 8 characters for ontologies, and 12 characters for ontology concepts, typically consisting of random integers.
A word may occur in several multiple‑word terms. For example, in High‑Energy Physics Keyword Index (HEP) the term “double” occurs in the following concepts: “double absorption”, “double beam”, “double exchange”, “double scattering”, “double spectral function”, “double-beta decay”. The Ontology Importer ensures that correct state transitions are specified for all these different cases.
Further, Ontology Importer also caters for the cases in which one concept is a terminological extension of another. For example, HEP contains the term “electron”, but also “electron cooling”, which is a separate term. If the word “electron” is followed in a document text by the word “cooling”, it must be recognized as the term “electron cooling”, but otherwise, it must be recognized as the term “electron”.
Prior to the assignment of the state transitions to a word, this word must be normalized, as mentioned earlier. To this purpose the SANDRA Content Engine is integrated into the SANDRA Ontology Importer.
SANDRA Ontology Importer does not require any special installation. Just create a separate folder, and copy the following files into it:
ontoedit.exe
msvcr70.dll
sandra.ini
In addition, you will also need the folder resources containing files with lexical resources required for the text normalization. However, this folder can be located anywhere, and not necessarily in the newly created folder.
Open sandra.ini – it is a simple text file with following entries:
|
Key |
Value |
|
Logging: |
Path to the folder in which a log file used for
debugging, as well as files containing unrecognized characters and words
(i.e. words that do not occur in the lexicons used by the Content Engine)
will be stored. If no value provided, all these files will be stored in the
same folder where ontoedit.exe is located. |
|
resources: |
Path to the folder where the lexical resources
required for the Content Engine are located. This value must be specified. |
|
lexicon: |
The relative path within the resources folder
to the file containing the lexicon that will be used by the Content Engine.
Multiple entries of this key are allowed, catering for the possibility to use
multilingual lexicons. This value must be specified. |
|
ontology-ID: |
GUID specifying the identity of the ontology that
will be processed. It can be any string of any length, although 8 character
strings containing random integers is recommended. This value must be
specified. |
|
File-out: |
Path to the folder in which the results of the
processing will be stored. This file will be named output.txt. If no
value provided, this file will be stored in the same folder where ontoedit.exe
is located. If this file already exists, the content of this
file will be loaded and used for further processing. This allows you to terminate
processing of ontology, and then continue it at some later point in time by
using the already processed terminology. |
Set the desired values in sandra.ini, and run the Ontology Importer (by double clicking on its icon).
Ontology Importer is a console‑based application, and is operated by command‑line arguments. However, its operating is very simple, and very little typing is actually required.
After initiating the Ontology Importer the following window will be displayed:

Press ENTER, and wait until the resources are loaded. The time required for loading the resources depends mainly on the number and the size of the lexicons specified in sandra.ini. Typically, this time is not expected to exceed several tens of seconds.
When the resources are loaded, the following message will be displayed:

Inspect whether the correct lexicons were loaded, and whether all of the required lexicons were loaded. If not, exit the Ontology Importer, and modify the suitable entries in sandra.ini.
Otherwise, enter the path to the file containing the terms expressing the ontology concepts that need to be processed. Make sure that the these terms correspond to the ontology GUID (presented within the parentheses) – all the processed terms will be automatically associated with this GUID. If this GUID is unsuitable, close the Ontology Importer, and modify the corresponding value in sandra.ini.
The path can be specified also as relative to the folder in which ontoedit.exe is located. For example: “./test.txt”, or “test.txt” (both are equivalent). Note that unlike the MS Windows file system, slashes (“/”), rather then backslashes (“\”) must be used as the delimiters in the path.
The specified file is allowed to contain only terms in plain text (ASCII) – each in a separate line. Multiple‑word terms must contain words separated by space character (ASCII #32). The current version of the Ontology Importer does not support conversion of complex formats, such as RDF/XML or OWL. If you want to process the terminology stored in one of these formats, you will need to pre‑process it in order to convert it into plain text.
Beware that the specified file will not be preserved in
its original format during the processing. Whenever saving the processing
results, only the unprocessed content of this file will be saved in it. This
includes also the terms that you decided not to process.
After you entered the file name press ENTER again. The specified file will be opened and loaded. After the loading is completed the following message will be displayed:

The message displays the first line of the specified file. You can decide whether you want to process this line, or ignore it. If ignored, this line will be saved in the modified version of the original file.
Type “y”, and then press ENTER. The following message will be displayed:

The first line in this message presents the processed term in the SANDRA’s internal format. Inspect this line carefully, in order to ensure that it reflects authentically the original term. If for any reason you suspect that this internal format does not reflect the original term, type “n”, and press ENTER, in order to ignore it. This term will be saved with other unprocessed terminology. Provide these suspected problematical terms with a suitable explanation to the SANDRA development team.
If after inspection everything seems to be correct, type “y”, and press ENTER. The following message will be displayed:

After each processed line you will be prompted to save the results of the processing. Whenever you decide to save the results, not only the last processed line, but also all processing results will be saved. Accordingly, you do not need to save every line separately. You can wait until the end of the file, and then save all the results at once.
Type “n”, and then press ENTER. The following message will be displayed:

This message displays the next line in the specified file. Type “y”, and then press ENTER. When prompted whether to continue, type “y” again, and press ENTER. The following message will be displayed:

The word “cooling” has multiple base‑forms retrieved from the lexicon: “cooling” and “cool”. It is of critical importance to select the correct base‑form, since it directly influences the results of the concept‑indexing. For example, if you select the base‑form “cool”, the word sequence “electrons cool” encountered in the sentence: “At this stage electrons cool gradually (…) ”, would be erroneously recognized as the concept “electron cooling”.
In order to select the correct base form, enter the number in front of the corresponding base‑form (in this example “0”), and press ENTER.
Continue processing the specified file line by line until the end of it. When you reach the end of the file, the following message will be displayed:

If there are any results that were not yet saved, type “y”, and then press ENTER. The following message will be displayed:

You can exit the Ontology Importer now, and open the file output.txt saved in the same folder in which ontoedit.exe is located in order to see the results. The content of that file must be identical to the results displayed in the Ontology Importer. When you complete processing the entire desired ontology, you can copy this file under a different name (for example, HEP.txt) into the ontologies folder located in the resources folder of the Content Engine, and specify this new name as the value of the ontology key in the file sandra.ini of the Content Engine. Next time when you run SANDRA Content Engine, this file will be automatically loaded, and used for the concept‑indexing of the processed documents.
Prepare a file containing all ontology concepts. Delete or remove the output.txt file from the folder in which ontoedit.exe is located. Run Ontology Importer for as long as you want. Before exiting Ontology Importer be sure that you saved the last processed line – all the processing results will be automatically saved with it. Run the Ontology Importer at some later point in time without removing or modifying either output.txt, or the file containing the ontology. During the previous processing session the already processed lines were removed from the original file, and the processed terminology will be automatically loaded with other resources from the output.txt when you decide to run Ontology Importer again.
Set the value of the ontology‑ID key in sandra.ini to the GUID of the first ontology that will be processed. Run the Ontology Importer and select the corresponding file containing the (first) ontology. When you complete processing that file, set the value of the ontology‑ID key in sandra.ini to the GUID of the second ontology that will be processed. Do not remove or modify output.txt. Run the Ontology Importer again, and select the corresponding file containing the (second) ontology. Repeat these steps for all the desired ontologies. When you complete processing all of them, output.txt will continue all the processed concepts – each with the corresponding ontology GUID.
SANDRA’s internal format may be surprising, but still completely acceptable for the concept‑indexing. For example, a term containing two alternative words separated by a slash (“/”), such as “input/output”, will be internally presented as two separate words with a tag “<SLASH>” between them: “(<WORD> “input”) <SLASH> (<WORD> “input”)”. On the other hand, some terms containing two alternative words separated by a slash will be recognized as a legal alternate formulation. For example “I/O” will be internally presented as only one word: “(<ALTER> “I/O”)”.
In both of the above cases the internal presentation can be accepted, since SANDRA will correctly recognize them in the original text.
Nevertheless, whenever in doubt, ignore these potentially problematic terms, and provide them with a suitable explanation to the SANDRA development team.
This may occur if SANDRA Ontology Importer encounters an unforeseen error. Close the Ontology Importer, and report this to the SANDRA development team providing the following: onto_debug.log, the file containing the ontology that was processed when the message was displayed, output.txt, and a short description of the circumstances in which this message was encountered, that will help the development team to reconstruct it.
Be aware of the fact that the error messages are not necessarily due to the bugs in the Ontology Importer – they may be also result of the external circumstances. It is therefore recommended to try to rerun Ontology Importer with the same ontology file, and see whether the problem will reoccur.