SANDRA Categorization Engine is configured by setting parameter values in the sandra.ini file. This file must be located in the same directory in which SANDRA Categorization Engine executable (cateng.exe) is located. sandra.ini is a simple text file in which each line consists of a key and a corresponding value separated by a tab character. The tab character must follow the key, even if the value is empty. sandra.ini contains the following keys:
The corresponding value specifies the path to the folder containing normalized document format (NDF) files that will be processed. NDF files are text files created by SANDRA Categorization Engine. Only files with the extension txt will be processed. All other files in the folder will be ignored. A file with the extension txt that does not contain NDF will be opened, but will fail the validation and will be also ignored. Subfolders will be also ignored. If no value is specified, or if the specified folder does not exist, or if it does not contain any files with the extension txt, a warning message will be recorded in the log file, and the processing will be aborted.
The corresponding value specifies the path (including the file name) to the file in which the results will be stored. The file can have any extension, but it will be saved as plain text file. If the file already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, a warning message will be recorder in the log file, and the processing will be aborted.
The corresponding value specifies the path to the file in which the log will be recorded. The path must not include the file name. – The file containing the log will be automatically named catg_debug.log. If the file already exists, it content will be overwritten. If the file does not exist, it will be created. If no value is specified, the log file will be created in the same directory in which cateng.exe is located.
The execution of SANDRA Categorization Engine can be conveniently monitored by tailing this log file. There are many tailing applications available as freeware.
The corresponding value specifies the path to the folder containing the resource files required to execute SANDRA Categorization Engine. If no value is specified, or if the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.
The corresponding value specifies the filename of the file containing stop‑phrases. This file must be located in the resources folder. This file must be a simple text file containing each stop‑phrase in a separate line.
Stop‑phrases are multiple‑word phrases that are considered undesired as categories, such as “large variety” or “table of contents”. Even if SANDRA Categorization Engine extracts such stop‑phrases, they will be still omitted from the final results.
If the value is not specified, no stop‑phrases will be omitted from the final results.
The corresponding value specifies the filename of the file containing stop‑templates. This file must be located in the resources folder. This file must be a simple text file containing each stop‑template in a separate line.
Stop‑template is a stop‑phrase formulated in SANDRA Regular Expression Framework (SREF). By using a limited set of regular expressions SREF allows a more flexible formulation of stop‑phrases (see below). For example, a stop‑template “other *” will instruct SANDRA Categorization Engine to omit from the final results all categories starting with the word “other” an then followed by none, one or more words (e.g., “other side”, “other people”, “other hand”, etc.).
If the value is not specified, no stop‑templates will be used, and no stop‑phrases formulated in SREF will be omitted from the final results.
The corresponding value specifies the minimal number of documents in which a lexical pattern must occur in order to be considered as a candidate for a category in the final results. The value must be a positive integer bigger then 2, or a zero. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) that specified value is not a positive integer, (4) the specified value is smaller then 3, but not zero. If the specified value is zero (0), the minimal number of documents required for a lexical pattern in order to be considered as a candidate for a category will be calculated automatically, based on the number and the size of the processed NDF files.
The corresponding value specifies the number of the display levels. The value must be either 1 or 2. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is neither 1, nor 2.
The results can be displayed either as a simple list of extracted categories (display‑level value is 1), or as a tree consisting of (1) top‑level categories, each of which is associated with (2) a list of related categories (display‑level value is 2). In both displays categories are sorted descending by the number of documents in which they occur. The capitalization and other orthographic details reflect the most frequent occurrence of the category in the document texts. The following table presents differences in the differences between these two displays:
Display-level = 1 |
Display-level = 2 |
|
user interface information infrastructure World Wide Web national information
infrastructure speech recognition interface design computer systems computer science state of the art workshop participants system design shopping mall information systems RESEARCH ISSUES communications systems Web pages Web browsers information retrieval natural language
processing end user research projects information technology information sources Unclassified Documents |
user interface Web browsers shopping mall information
retrieval natural language
processing system design research
projects information infrastructure communications
systems Web browsers information
systems information
technology computer science information
sources World Wide Web information
technology communications
systems workshop
participants information
systems research
projects computer science national information
infrastructure communications systems information systems computer science information sources state of the
art information
technology speech recognition natural language
processing information
retrieval research
projects communications
systems Web pages state of the art interface design information
technology system design Web browsers information
systems RESEARCH ISSUES computer systems Unclassified Documents |
The corresponding value specifies the maximal number of categories in each level (i.e. in the top‑level and the related categories). The value must be a positive integer bigger then 3, or a zero. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is not a positive integer, (4) the specified value is smaller then 4, but not zero. If the specified value is zero (0), the maximal number of categories in each level will be calculated automatically, based on the number and the size of the processed NDF files. If the display‑level value is set to 1, this value will be ignored.
SANDRA Regular Expression Framework (SREF) is a small regular expression language that allows scanning of word sequences searching for a particular lexical pattern. Rather then requiring an exact string match, this pattern can be formulated in SREF by utilizing regular expressions. For example, if all word sequences terminated by the word “will” need to be detected, the search pattern can be formulated in SREF as “* will”. – This would successfully detect the following word sequences: “he will”, “the power of will”, “William is typically shortened to Will”, “will”, etc.
G(SANDRA Regular Expression Framework) = {V, T, S, P}
V = {A, B, C …}
T = {*, ^, [, ]}
P = {
|
|
S |
-> |
A |
|
|
S |
-> |
S A |
|
|
S |
-> |
* S |
|
|
S |
-> |
S * |
|
|
S |
-> |
^A S |
|
|
S |
-> |
S ^A |
|
|
S |
-> |
[A B] |
|
|
S |
-> |
S [A B] |
|
|
S |
-> |
^[A B] S |
|
|
S |
-> |
S ^[A B] |
}
Non‑terminals “A”, “B”, “C” … stand for words in natural language.
Symbol “*” indicates none, one or more words.
Symbol “^” indicates none, one or more words, except the word immediately following the symbol (no empty space is allowed between the symbol and the following word).
Symmetric parentheses “[“ and “]” delimit alternate elements indicating exclusive disjunction (“XOR”).
Symbol “^” followed by several words enclosed by parentheses “[“ and “]” indicates none, one or more words, except the words enclosed by the parentheses.
Example of WFF:
“^of will”
Interpretation: none, one or more words, none of which is “of” followed by the word “will”.
Matches:
“he will”
“William is typically shortened to Will”
“will”
Mismatches:
“the power of will”