How to Configure SANDRA Content Engine

SANDRA Content Engine is configured by setting parameter values in the sandra.ini file. This file must be located in the same directory in which SANDRA Content Engine executable (conteng.exe) is located. sandra.ini is a simple text file in which each line consists of a key and a corresponding value separated by a tab character. The tab character must follow the key, even if the value is empty. sandra.ini contains the following keys:

folder-in

The corresponding value specifies the path to the folder containing plain‑text files with the original document text that will be processed. Only files with the extension txt will be processed. All other files in the folder will be ignored. A file with the extension txt that does not contain text will be opened, but will fail the validation, and will be also ignored. Subfolders will be also ignored. If no value is specified, or if the specified folder does not exist, or if it does not contain any files with the extension txt, a warning message will be recorder in the log file, and the processing will be aborted.

index-folder-out

The corresponding value specifies the path to the folder in which the results of the keyword indexing will be stored. All output files will be saved as plain text with the extension txt. The name of the file will be identical to the name of the corresponding file containing the original document text, and located in the folder specified as the value of the folder‑in key. If the file with the same name already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, keyword indexing will be omitted. If the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

ne-folder-out

The corresponding value specifies the path to the folder in which the results of the named entity extraction will be stored. All output files will be saved as plain text with the extension txt. The name of the file will be identical to the name of the corresponding file containing the original document text, and located in the folder specified as the value of the folder‑in key. If the file with the same name already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, named entity extraction will be omitted. If the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

ndf-folder-out

The corresponding value specifies the path to the folder in which the normalized document formats (NDF) will be stored. NDF is used by SANDRA Categorization Engine for categorization. All output files will be saved as plain text with the extension txt. The name of the file will be identical to the name of the corresponding file containing the original document text, and located in the folder specified as the value of the folder‑in key. If the file with the same name already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, NDF will not be created. If the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

cbi-folder-out

The corresponding value specifies the path to the folder in which the results of the concept indexing will be stored. All output files will be saved as plain text with the extension txt. The name of the file will be identical to the name of the corresponding file containing the original document text, and located in the folder specified as the value of the folder‑in key. If the file with the same name already exists, its content will be overwritten. If the file does not yet exist, it will be created. If no value is specified, concept indexing will be omitted. If the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

logging

The corresponding value specifies the path to the file in which the log will be recorded. The path must not include the file name. – The file containing the log will be automatically named cont_debug.log. If the file already exists, its content will be overwritten. If the file does not exist, it will be created. If no value is specified, a warning message will be recorder in the log file, and the processing will be aborted.

The execution of SANDRA Content Engine can be conveniently monitored by tailing this log file. There are many tailing applications available as freeware.

resources

The corresponding value specifies the path to the folder containing the resource files required to execute SANDRA Content Engine. If no value is specified, or if the specified folder does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

lexicon

The corresponding value specifies the path to the file containing the lexicon required to execute SANDRA Content Engine. This is a plain text file containing a lexicon in an internal format. Multiple entries of this key are possible, specifying various lexicons that will be loaded (for example, for multilingual processing). If the specified file does not exist, a warning message will be recorder in the log file, and the processing will be aborted.

ontology

The corresponding value specifies the path to the file containing the ontology required for concept indexing. This is a plain text file containing an ontology in an internal format. This internal format is generated by the SANDRA Ontology Importer – an auxiliary application designed to transform ontologies into the format used by SANDRA for concept indexing. Multiple entries of this key are possible, specifying various ontologies that will be loaded. If the specified file does not exist, a warning message will be recorder in the log file, and the concept indexing will be omitted.

max-branch-length

The corresponding value specifies the maximal length of a branch in the normalized document format (NDF) counted in nodes. NDF is a suffix‑trie consisting of noun phrases of particular types. This value specifies what is the maximal number of words allowed in such noun phrase. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is not a positive integer, (4) the specified value is zero (0), (5) the specified value is bigger then 6. The last restriction is an arbitrary decision based on the experience regarding what is considered a useful lexical pattern, and designed to minimize the size of NDF.

language-recognition-min

The corresponding value specifies the minimal percentage of the most frequently occurring words in a language detected in a document, in order to consider that the document was composed in the corresponding language. SANDRA Content Engine supports currently recognition of English, German and Italian. If none of these languages can be assigned to a document, this document will not be processed. A warning message will be recorded in the log file, and the processing will be aborted in the following cases: (1) no value is specified, (2) the specified value is not an integer, (3) the specified value is not a positive integer, (4) the specified value is zero (0), (5) the specified value is bigger then 30, (6) if the specified value is smaller then 10. The last restriction is an arbitrary decision based on the experience regarding what is the minimal lexical content that justifies the processing of a document file.