How to Configure SANDRA Categorization Engine

SANDRA Categorization Engine is configured by setting parameter values in the sandra.ini file. This file must be located in the same directory in which SANDRA Categorization Engine executable (cateng.exe) is located. sandra.ini is a simple text file in which each line consists of a key and a corresponding value separated by a tab character. The tab character must follow the key, even if the value is empty. sandra.ini contains the following keys:

folder-in

The corresponding value specifies the path to the folder containing normalized document format (NDF) files that will be processed. NDF files are text files created by SANDRA Categorization Engine. Only files with the extension txt will be processed. All other files in the folder will be ignored. A file with the extension txt that does not contain NDF will be opened, but will fail the validation and will be also ignored. Subfolders will be also ignored. If no value is specified, or if the specified folder does not exist, or if it does not contain any files with the extension txt, a warning message will be recorded in the log file, and the processing will be aborted.

file-out

The corresponding value specifies the path (including the file name) to the file in which the results will be stored. The file can have any extension, but it will be saved as plain text file. If the file already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, a warning message will be recorder in the log file, and the processing will be aborted.

logging

The corresponding value specifies the path to the file in which the log will be recorded. The path must not include the file name. – The file containing the log will be automatically named catg_debug.log. If the file already exists, it content will be overwritten. If the file does not exist, it will be created. If no value is specified, the log file will be created in the same directory in which cateng.exe is located.

The execution of SANDRA Categorization Engine can be conveniently monitored by tailing this log file. There are many tailing applications available as freeware.

resources

The corresponding value specifies the path to the folder containing the resource files required to execute SANDRA Categorization Engine. If no value is specified, or if the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

stop-phrases

The corresponding value specifies the filename of the file containing stop‑phrases. This file must be located in the resources folder. This file must be a simple text file containing each stop‑phrase in a separate line.

Stop‑phrases are multiple‑word phrases that are considered undesired as categories, such as “large variety” or “table of contents”. Even if SANDRA Categorization Engine extracts such stop‑phrases, they will be still omitted from the final results.

If the value is not specified, no stop‑phrases will be omitted from the final results.

stop-templates

The corresponding value specifies the filename of the file containing stop‑templates. This file must be located in the resources folder. This file must be a simple text file containing each stop‑template in a separate line.

Stop‑template is a stop‑phrase formulated in SANDRA Regular Expression Framework (SREF). By using a limited set of regular expressions SREF allows a more flexible formulation of stop‑phrases (see below). For example, a stop‑template “other *” will instruct SANDRA Categorization Engine to omit from the final results all categories starting with the word “other” an then followed by none, one or more words (e.g., “other side”, “other people”, “other hand”, etc.).

If the value is not specified, no stop‑templates will be used, and no stop‑phrases formulated in SREF will be omitted from the final results.

min-doc-num

The corresponding value specifies the minimal number of documents in which a lexical pattern must occur in order to be considered as a candidate for a category in the final results. The value must be a positive integer bigger then 2, or a zero. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) that specified value is not a positive integer, (4) the specified value is smaller then 3, but not zero. If the specified value is zero (0), the minimal number of documents required for a lexical pattern in order to be considered as a candidate for a category will be calculated automatically, based on the number and the size of the processed NDF files.

display-level

The corresponding value specifies the number of the display levels. The value must be either 1 or 2. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is neither 1, nor 2.

The results can be displayed either as a simple list of extracted categories (display‑level value is 1), or as a tree consisting of (1) top‑level categories, each of which is associated with (2) a list of related categories (display‑level value is 2). In both displays categories are sorted descending by the number of documents in which they occur. The capitalization and other orthographic details reflect the most frequent occurrence of the category in the document texts. The following table presents differences in the differences between these two displays:

Display-level = 1

Display-level = 2

user interface

information infrastructure

World Wide Web

national information infrastructure

speech recognition

interface design

computer systems

computer science

state of the art

workshop participants

system design

shopping mall

information systems

RESEARCH ISSUES

communications systems

Web pages

Web browsers

information retrieval

natural language processing

end user

research projects

information technology

information sources

Unclassified Documents

 

user interface

                Web browsers

                shopping mall

                information retrieval

                natural language processing

                system design

                research projects

information infrastructure

                communications systems

                Web browsers

                information systems

                information technology

                computer science

                information sources

World Wide Web

                information technology

                communications systems

                workshop participants

                information systems

                research projects

                computer science

national information infrastructure

                communications systems

                information systems

                computer science

                information sources

                state of the art

                information technology

speech recognition

                natural language processing

                information retrieval

                research projects

                communications systems

                Web pages

                state of the art

interface design

                information technology

                system design

                Web browsers

                information systems

                RESEARCH ISSUES

                computer systems

Unclassified Documents

 

cat-per-level

The corresponding value specifies the maximal number of categories in each level (i.e. in the top‑level and the related categories). The value must be a positive integer bigger then 3, or a zero. This value must be specified. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is not a positive integer, (4) the specified value is smaller then 4, but not zero. If the specified value is zero (0), the maximal number of categories in each level will be calculated automatically, based on the number and the size of the processed NDF files. If the display‑level value is set to 1, this value will be ignored.

SANDRA Regular Expression Framework (SREF)

Introduction

SANDRA Regular Expression Framework (SREF) is a small regular expression language that allows scanning of word sequences searching for a particular lexical pattern. Rather then requiring an exact string match, this pattern can be formulated in SREF by utilizing regular expressions. For example, if all word sequences terminated by the word “will” need to be detected, the search pattern can be formulated in SREF as “* will”. – This would successfully detect the following word sequences: “he will”, “the power of will”, “William is typically shortened to Will”, “will”, etc.

Language

G(SANDRA Regular Expression Framework) = {V, T, S, P}

V = {A, B, C …}

T = {*, ^, [, ]}

P = {

 

S

->

A

 

S

->

S A

 

S

->

* S

 

S

->

S *

 

S

->

^A S

 

S

->

S ^A

 

S

->

[A B]

 

S

->

S [A B]

 

S

->

^[A B] S

 

S

->

S ^[A B]

}

Non‑terminals “A”, “B”, “C” … stand for words in natural language.

Symbol “*” indicates none, one or more words.

Symbol “^” indicates none, one or more words, except the word immediately following the symbol (no empty space is allowed between the symbol and the following word).

Symmetric parentheses “[“ and “]” delimit alternate elements indicating exclusive disjunction (“XOR”).

Symbol “^” followed by several words enclosed by parentheses “[“ and “]” indicates none, one or more words, except the words enclosed by the parentheses.

Examples

Example of WFF:

“^of will”

Interpretation: none, one or more words, none of which is “of” followed by the word “will”.

Matches:

he will”

“William is typically shortened to Will”

will

Mismatches:

the power of will”