Available data types are:
Vector
Protein sequence
DNA sequence
Peptide fragmentation spectrum
Image
The test suite has been used to evaluate the MoBIoS Index. The Molecular Biological Information System (MoBIoS) is a next generation Database Management System (DBMS) aiming at biological applications. It is composed of a storage manager, an extended SQL language, named mSQL, and some biological applications. The novelty of the storage manager is the distance-based index structure, named MoBIoS Index. Written in JAVA, MoBIoS Index supports similarity queries of any data type that can be abstracted into general metric space. The MoBIoS Index has some built-in data types, such as vector, gene sequence, image. User-defined data types are also supported. For detail and download, please see the MoBIoS website. All these data types and the distance metric are defined in MoBIoS Index. For each data type, the corresponding data type and distance function in MoBIoS Index are listed. Please contact Rui Mao (rmao AT cs DOT utexas DOT edu)
If you would like to use this test suite, please cite:
Rui Mao, Weijia Xu, Smriti Ramakrishnan, Glen Nuckolls, Daniel P. Miranker. "On Optimizing Distance-Based Similarity Search for Biological Databases". In the Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, Page 351-361, August 8-11, 2005, Stanford University, California, USA. (BibTex (from ACM Portal))
or for the image dataset, please cite:
Rui Mao, Qasim Iqbal, Wenguo Liu, Daniel P. Miranker. "Case Study: Distance-Based Image Retrieval in the MoBIoS DBMS". In the Proceedings of The 5th International Conference on Computer and Information Technology (CIT2005), page 49-55, September 21-23, 2005, Shanghai, China. (BibTex (From ACM Portal))
| Dataset |
(1) Uniform 20-d vector (171M) Synthetic dataset, 1 million uniformly distributed vectors randomly selected from the 20-d [0,1] hyper-unit cube.
(2) Uniform 5-d vector (44M)
(3) Clustered vector (36M) (4) Clustered vector (647K) (5) 3-d Geospatial data (647K)
(6)US Cartographic Boundary Files: Hawii boundary file (123k), Texas boundary file (3.3M) All 50 states and Puerto Rico (61M) | File format: | The first line of each file consists of two numbers, the dimension of the vectors, and the number of vectors, separated by white space. Then, each line is a vector, with each dimension separated by white spaces. | Distance metric | Any L-metric | In MoBIoS Index | Data type: mobios.type.DoubleVector Distance metric: mobios.dist.LMetric Suggested range query radii:0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 (for synthetic datasets); 0, 0.02, 0.04, 0.06, 0,08, 0.1 (for boundary data) |
| Dataset | Protein sequence (370M) Protein sequence downloaded from GenBank. | File format: | FASTA format | Distance metric | Global alignment on 6-mers (fragments of length 6) with the mPAM substitution matrix. | In MoBIoS Index | Data type: mobios.type.Peptide Distance metric: mobios.dist.WHDGlobalSequenceFragmentMetric Suggested range query radii:0, 1, 2, 3, 4, 5, 6 |
| Dataset | Arabidopsis thaliana genome (34M) | 1 Million k-mers of length 5 from the Arabidopsis thaliana genome (4M) | File format: | FASTA format | Distance metric | Hamming distance on 18-mers. | In MoBIoS Index | Data type: mobios.type.DNA Distance metric: mobios.dist.WHDGlobalSequenceFragmentMetric Suggested range query radii:0, 1, 2, 3, 4, 5, 6 |
| Dataset | Peptide fragmentation spectrum (5.9M) | File format: | Each line represents the vector format of a spectrum | Distance metric | The distance function of this data type is not metric, but semi-metric. As a result, the search algorithm is slightly different for this data type. See: Ramakrishnan, Smriti R., Rui Mao, Aleksey A. Nakorchevskiy, John T. Prince, Willard S. Willard, Weijia Xu, Edward M. Marcotte, and Daniel P. Miranker. "A fast coarse filtering method for protein identification by mass spectrometry." Bioinformatics, 22(12):1524-1531; doi:10.1093/bioinformatics/btl118. 2006. |
In MoBIoS Index | Data type: mobios.type.TandemSpectra Distance metric: mobios.dist.MSMSMetric Suggested range query radii:0, 0.03, 0.06, 0.09 |
| Dataset | Feature vector of images (2.7M, for indexing) Source images (38M, for view purpose) | File format: | This image dataset consists of 10221 images. Each image is represented by 3 vectors corresponding to its properties in structure, color, and texture. | Distance metric | A linear combination of L-metrics on the 3 feature vectors. | In MoBIoS Index | Data type: mobios.type.Image Distance metric: mobios.dist.ImageMetric Suggested range query radii:0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3 |