Test suite for distance-based/metric space indexing

This page describes a workbench for distance-based (metric space) indexing. Several data types, together with the file format and their distance functions, are discussed.

Available data types are:
Vector
Protein sequence
DNA sequence
Peptide fragmentation spectrum
Image

The test suite has been used to evaluate the MoBIoS Index. The Molecular Biological Information System (MoBIoS) is a next generation Database Management System (DBMS) aiming at biological applications. It is composed of a storage manager, an extended SQL language, named mSQL, and some biological applications. The novelty of the storage manager is the distance-based index structure, named MoBIoS Index. Written in JAVA, MoBIoS Index supports similarity queries of any data type that can be abstracted into general metric space. The MoBIoS Index has some built-in data types, such as vector, gene sequence, image. User-defined data types are also supported. For detail and download, please see the MoBIoS website. All these data types and the distance metric are defined in MoBIoS Index. For each data type, the corresponding data type and distance function in MoBIoS Index are listed. Please contact Rui Mao (rmao AT cs DOT utexas DOT edu)

If you would like to use this test suite, please cite:

Rui Mao, Weijia Xu, Smriti Ramakrishnan, Glen Nuckolls, Daniel P. Miranker. "On Optimizing Distance-Based Similarity Search for Biological Databases". In the Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, Page 351-361, August 8-11, 2005, Stanford University, California, USA. (BibTex (from ACM Portal))

or for the image dataset, please cite:

Rui Mao, Qasim Iqbal, Wenguo Liu, Daniel P. Miranker. "Case Study: Distance-Based Image Retrieval in the MoBIoS DBMS". In the Proceedings of The 5th International Conference on Computer and Information Technology (CIT2005), page 49-55, September 21-23, 2005, Shanghai, China. (BibTex (From ACM Portal))


Vector

Dataset (1) Uniform 20-d vector (171M)
Synthetic dataset, 1 million uniformly distributed vectors randomly selected from the 20-d [0,1] hyper-unit cube.

(2) Uniform 5-d vector (44M)
Synthetic dataset, 1 million uniformly distributed vectors randomly selected from the 5-d [0,1] hyper-unit cube.

(3) Clustered vector (36M)
Synthetic dataset, 1 Million vectors selected form the 10-d [0,1] unit square, forming 10 clusters.

(4) Clustered vector (647K)
Synthetic dataset, 100k vectors selected form the 2-d [0,1] unit square, forming 100 clusters.

(5) 3-d Geospatial data (647K)
2,034,953 vectors from a small subset of points from a big bang simulation.

(6)US Cartographic Boundary Files: Hawii boundary file (123k), Texas boundary file (3.3M) All 50 states and Puerto Rico (61M)
These data is downloaded from the U.S. Census Bureau website on May 09, 2007. There are the "Cartographic Boundary Files" from (http://www.census.gov/geo/www/cob/bdy_files.html). Be specific, there are from the "Census Block Groups" of year 2000: (http://www.census.gov/geo/www/cob/bg2000.html). There are of the "ARC/INFO Ungenerate (ASCII)" format. The downloaded files are plain text files with some meta data. In our data files, these mete data are removed, and only the 2-d coordinates are kept. Each line contains the two coordinate of a point, separated by while space. There may or maynot be while space at the beginning or end of the lines. File size (number of points)
----------------------------
texas.txt: 566,150 (194,724 distinct points), boundary file of Texas, from file: bg48_d00.dat
hawii.txt: 22,255 (9290 distinct points), boundary file of Hawii, from file: bg15_d00.dat
gis.vec: 9,685,974 (3,188,005 distinct points), boundary files of all 50 states and Puerto Rico

File format: The first line of each file consists of two numbers, the dimension of the vectors, and the number of vectors, separated by white space. Then, each line is a vector, with each dimension separated by white spaces.
Distance metric Any L-metric
In MoBIoS Index Data type: mobios.type.DoubleVector
Distance metric: mobios.dist.LMetric
Suggested range query radii:0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 (for synthetic datasets); 0, 0.02, 0.04, 0.06, 0,08, 0.1 (for boundary data)


Protein sequence

Dataset Protein sequence (370M)
Protein sequence downloaded from GenBank.
File format: FASTA format
Distance metric Global alignment on 6-mers (fragments of length 6) with the mPAM substitution matrix.
In MoBIoS Index Data type: mobios.type.Peptide
Distance metric: mobios.dist.WHDGlobalSequenceFragmentMetric
Suggested range query radii:0, 1, 2, 3, 4, 5, 6


DNA sequence

Dataset Arabidopsis thaliana genome (34M) 1 Million k-mers of length 5 from the Arabidopsis thaliana genome (4M)
File format: FASTA format
Distance metric Hamming distance on 18-mers.
In MoBIoS Index Data type: mobios.type.DNA
Distance metric: mobios.dist.WHDGlobalSequenceFragmentMetric
Suggested range query radii:0, 1, 2, 3, 4, 5, 6


Peptide fragmentation spectrum

Dataset Peptide fragmentation spectrum (5.9M)
File format: Each line represents the vector format of a spectrum
Distance metric The distance function of this data type is not metric, but semi-metric. As a result, the search algorithm is slightly different for this data type. See:
Ramakrishnan, Smriti R., Rui Mao, Aleksey A. Nakorchevskiy, John T. Prince, Willard S. Willard, Weijia Xu, Edward M. Marcotte, and Daniel P. Miranker. "A fast coarse filtering method for protein identification by mass spectrometry." Bioinformatics, 22(12):1524-1531; doi:10.1093/bioinformatics/btl118. 2006.
In MoBIoS Index Data type: mobios.type.TandemSpectra
Distance metric: mobios.dist.MSMSMetric
Suggested range query radii:0, 0.03, 0.06, 0.09


Image

Dataset Feature vector of images (2.7M, for indexing)
Source images (38M, for view purpose)
File format: This image dataset consists of 10221 images. Each image is represented by 3 vectors corresponding to its properties in structure, color, and texture.
Distance metric A linear combination of L-metrics on the 3 feature vectors.
In MoBIoS Index Data type: mobios.type.Image
Distance metric: mobios.dist.ImageMetric
Suggested range query radii:0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3