wiki:documentation

Documentation

This documentation is preliminary and is not intended to be a complete manual. In order to understand the following, we recommend to read the D-index paper at least. Please, refer to Bibliography.

The implementation of D-index is built on top of the GiST library (as is the M-tree Project from Bologna) and is completely written in C++.

Overview

The D-index structure is static with respect to the number of buckets and levels, so the split functions must be designed before the D-index is instantiated and loaded with data. When this prerequisite is fulfiled, the D-index is capable of storing nearly "unlimited" number of data objects due to the elasticity of individual buckets. In particular, buckets are capable of storing theoretically any amount of data.

The pivot choosing algorithm we used is able to dynamically pick further pivots. But the problem of proper setting of split functions remains (due to dm parameters). We designed special tools for that purpose which help make it more automatic. Please, refer to Tools section below.

Usage

We implemented a system for easy usage of D-index especially good for loading it with data and querying. The system is based on text configuration files called control files. Usually, a separate file is defined for initializing an instance of D-index and populating with data (*.ini). A different file can be used just for printing details about an instance persisted in index and db files on a hard disk (*.load). For querying, files named *.srch.nn or *.srch.range are used. Examples of such files are included in the tools package.

A documentation of configuration files is available here.

Tools

Hash Design

HashDesign simply selects the desired number of pivots using specific selection technique. It produces two files:

  • one containing list of pivots (one pivot per line),
  • the other contains settings of dm with the results of partitioning (S0 & S1 are separable sets and S2 is the exclusion set).

The second file can directly be used to construct a D-index structure (e.g., see text.ini in the Tools package at Download page).

Hash Design Mult

HashDesignMult is a follow-up tool. You already must have the list of pivots. You specify a list of pivots that will be combined in a single split function (used for one level of the D-index). This tool produces "good" settings of dm values. The dm values produced are optimized with regards to the combination of binary split function, so the resulting setting tries to minimize the number of empty buckets and to equally spread data over buckets (balanced split).

The major difference from HashDesign is that HashDesign produces settings for binary split function only. Combining such functions into a single split function usually leads to unbalanced splits and lots of empty (or underfilled) buckets. It can also imply existence of overloaded buckets as well.

Example

The files concerning a text-data experiment contain basically two things. First, a real-life example of configuration files for D-index. Second, outcome from a supporting tool (HashDesign) (it's text.INCREMENTAL.* files).

Data File Format

Data files suitable for D-index must be in format: one data object per line.

For example, five 3D vectors:

0 1 2
3 4 6
2 1 3
5 6 7
1.0 2.1 4.6

Or text-like data:

This is the first line.
Another line.
The sun shines.
It's a beautiful day.

Customizing D-Index for Your Needs

I assume you have a specific data and a distance function (metric). The following tries to help you to ease the process of adapting the source code to new application domains.

The current implementation of D-index and its supporting tools rely on four functions:

  1. distance function;
  2. function to print data objects as a string to output;
  3. function to read data objects from a file;
  4. function returning the maximum distance (can be set to max float, or so).

These functions must be implemented properly to support your application. Currently, there are implementations for vectors with L1 metric, vectors with L2 metric, URLs (set of URLs) with Jaccard coefficient and strings with edit distance. There are implemented in user.cpp and declared in user.h (in D-index sources). You need to define your own versions.

Based on a configuration file (rule-based control file for executing an experiment), the correct implementation is selected by the functions in config.cpp. These functions parses the configuration file. Please, see the function named read_section_init and defines starting TXT_DATATYPE_* and TXT_DISTFUNC_*.

Last modified 4 years ago Last modified on Apr 8, 2014, 11:01:41 AM