bplustreedotnet -- A B+tree implementation for C#, java, and Python with related modules

quickstart
bplusdotnet project page with download links

Executive Summary: The bplusdotnet package is a library of cross compatible data structure implementations in C#, java, and Python which are useful for applications which need to store and retrieve persistent information. The bplusdotnet data structures make it easy to store string keys associated with values permanently. For example you could use bplusdotnet objects to associate names to addresses, or phone numbers to names, or book titles to book content, or names to face images. The bplusdotnet package is an open source indexed file implementation hosted on SourceForge.
Key features of the package include:

Very fast retrieval of values associated to key strings.
Cross platform portable file formats between standard C#, java, and Python implementations.
No practical data file size limitations.
Support for Unicode strings with localized string orderings. (localization for C# only)
Key values as unicode strings with either unlimited length or maximum length fixed at the creation of the index.
Mapping values as unicode strings or byte arrays of arbitrary size or as long integers.
Random access to keys and values by key equality.
Sequential access to "larger" keys and values by localized Unicode key ordering.
Support for commits or aborts for groups of index modifications.
Hash based indexing options for better expected performance in some cases.
Optional automatic serialization and deserialization of serializable objects. (C# only)
Configurable "core memory footprint" usage.
Automatic reclamation of unused file space.
Data structure recovery in some cases of file corruption.

General Technical Notes: the BplusDotNet package provides a variant of the classical BplusTree file indexing structure, which is one of the most beautiful and useful inventions of computer science, with significance to civilization rivalling the invention of the arch, double entry accounting, and arabic numerals. This variant differs from the classical presentation in that it does not maintain "next leaf" pointers in leaf nodes in order to provide a higher resilience to errors in the case of program failures -- in this way a "commit" or "abort" can be made final with the completion of a single disk buffer write. To provide additional transaction resilience no disk buffer is ever modified in place -- new data is always written to new buffers and the old buffers are release for reuse only after the root record has been committed.
For additional discussion of B+trees and related structures please see The Ubiquitous B-Tree (Computing Surveys, Vol. 11, No. 2, June 1979) also available at http://bibin.hio.no/sivbib2000/databaseteori/Ubiqbtre.htm as of this writing.

Project Links and Downloads

Please see the Source forge bplusdotnet project page http://sourceforge.net/projects/bplusdotnet/ for links to downloads and information regarding source code [CVS] access and project membership.

Documentation

The documentation for the modules consist of the remainder of this document in addition to the comments preceding the class, attribute, and method definitions found in the source code (using the standard VisualStudio.NET conventions). The documentation addresses the C# implementation primarily. The java and Python implementations are very similar. See the notes at the bottom additional comments regarding the java and Python implementations.

The tree interface source documentation are kept primarily in the ITreeIndex.cs interface definition file to avoid redundancy in the implementation files.

The BplusTree structure provides the most accessible interface for storing and retrieving string entries associated with string keys. For BplusTree the key length must map to a sequence of bytes (using utf8 encoding) which doesn't exceed the key length defined at the time the tree is created. The xBplusTree structure removes this keylength limitation (see below).

Creating a new BplusTree

There are quite a few ways to create a new BplusTree using the various signatures of the BplusDotNet.BplusTree.Initialize static method.

     // CREATE A NEW BPLUSTREE USING FILE PATHS
     public static BplusTree Initialize(string treefileName, string blockfileName, 
			int KeyLength, int CultureId,
			int nodesize, int buffersize)


     public static BplusTree Initialize(string treefileName, 
			string blockfileName, int KeyLength, int CultureId) 

     public static BplusTree Initialize(string treefileName, string blockfileName, int KeyLength) 

     // CREATE A NEW BPLUSTREE USING AN EXISTING STREAM
     public static BplusTree Initialize(System.IO.Stream treefile, 
			System.IO.Stream blockfile, int KeyLength, int CultureId,
			int nodesize, int buffersize) 

     public static BplusTree Initialize(System.IO.Stream treefile, 
			System.IO.Stream blockfile, int KeyLength, int CultureId) 

     public static BplusTree Initialize(System.IO.Stream treefile, 
			System.IO.Stream blockfile, int KeyLength)

Each of these methods create a BplusTree with an associated tree index stream, an associated block data stream for values, a maximum key size (in bytes), an associated Culture for determining the key ordering, a maximum number of children per leaf nodes (nodesize), and a buffersize for breaking value strings into chunks in the data file.

In the signatures above the files may be provided as prepared streams or may be specified as file system path name strings.

The key size determines how large keys may be in the index -- if it is set too low you will not be able to place large enough keys into the structure, if it is set too high you will waste space in the index file.

The CultureId determines the ordering to use for sorting the key values. The default used is System.Globalization.CultureInfo.InvariantCulture.LCID. (See JAVA/PYTHON NOTE 1.)

The nodesize and buffersize parameters may be useful for optimization and testing, but most users may find the default values acceptible.

Accessing an existing BplusTree

Gaining access to an existing BplusTree is simple than creating a new one because most of the parameters are stored in the file information within the tree. In this case you need only specify the file names or the stream objects to use for the BplusDotNet.BplusTree.ReOpen static method.

     public static BplusTree ReOpen(System.IO.Stream treefile, System.IO.Stream blockfile) 

     public static BplusTree ReOpen(string treefileName, string blockfileName)

Using a BplusTree

To associate one string with another using a Bplustree bpt write

     bpt["one string"] = "another string";
     bpt["a second string"] = "more information";

To get the associated value back write

     Console.WriteLine("the value for 'one string' is "+bpt["one string"])
     // prints "the value for 'one string' is another string\r\n" to the console

(See JAVA NOTE 2.)

If you attempt to retrieve a key which has no associated value the bpt data structure with throw a BplusTreeException. To test whether a key has an associated value use the ContainsKey method.

     if (bpt.ContainsKey("no such key yet")) {
         Console.WriteLine("the value for 'no such key yet' is "+bpt["no such key yet"]);
     } else {
         Console.WriteLine("the key 'no such key yet' is not in the index");
     }

The assignment made above will not be made "permanent" for future use, however until the index bpt is committed, using

     bpt.Commit();

Alternatively, to discard all modifications since the last commit point (or since the index was opened if that was more recent) use

     bpt.Abort();

To change the value associated with a key use the same index assignment notation.

     bpt["one string"] = "an alternate string";

To remove the association of a key with a value use

     bpt.RemoveKey("a second string");

The NextKey method gets the next larger key than the argument or null if there is no larger key. So to list all the keys and values between keys "russia" to the end of the index

     string currentkey = "russia";
     while (currentkey!=null) {
         if (bpt.ContainsKey(currentkey)) {
             Console.WriteLine("key="+currentkey+" value="+bpt[currentkey]);
         }
         currentkey = bpt.NextKey(currentkey);
     }

If you want to start at the first key of the index use

     string currentkey = bpt.FirstKey();

When you are done with an index which you have modified it is good practice to either Commit or Abort the modifications. If you wish to shut down a modified index immediately without going throught the abort procedure you may use the ShutDown method, which is equivalent to aborting but may leave some space in the index and data files unreachable. To attempt to recover unreachable space in the files after reopening an index use the Recover(true) method.

On xBplusTree

As mentioned above the xBplusTree implementation is very similar to the BplusTree implementation, except that the structure permits keys to be arbitrarily large. [The "x" is for "extended".]

The methods to create a xBplusTree have the same signature as those for creating a BplusTree except for the return type. For example here is the most verbose overload for BplusDotNet.xBplusTree.Initialize:

xBplusTree Initialize(string treefileName, string blockfileName, 
                        int PrefixLength, int CultureId,
			int nodesize, int buffersize)

the only difference being that PrefixLength replaces KeyLength. The PrefixLength value determines the number of bytes of each key which is used to locate it within the internal tree structure.

After creation an xBplusTree presents the exact same interface as a BplusTree except that all key values will be permitted in the index.

NOTE: The xBplusTree implementation is more expensive and slower than the BplusTree implementation -- so the BplusTree implementation should be preferred whenever this is acceptible. Also, an xBplusTree may perform poorly when, for example, 100000 strings are identified by the same byte prefix. It is a good idea to pick a PrefixLength large enough so that only a few key strings are expected to have the same prefix for the given application.

On hBplusTrees

If you need unlimited key lengths and do not need to access the elements in key sorted order you should consider using the hBplusTree variant. [In this case the "h" stands for "hashed".]

The hBplusTree variant uses a hashing scheme to allow keys of unlimited length with expected good performance regardless of the key values. However since the hashing mechanism randomizes the placement of the key values in the underlying tree implementation the FirstKey and NextKey methods will not present the keys in dictionary order, but will traverse the keys in an arbitrary order.

On BplusTreeBytes, xBplusTreeBytes, hBplusTreeBytes, and BplusTreeLong

BplusTreeBytes is similar to BplusTree except that the values associated with keys are arrays of bytes instead of unicode strings. xBplusTreeBytes is the analogous variant of BplusTreeBytes which supports unlimited key lengths, and maps unicode string keys to byte array values. hBplusTreeBytes is the analogous variant of BplusTreeBytes which supports unlimited key lengths using a hashing scheme to avoid key prefix collision problems and with no support for key sorted traversal. BplusTreeLong are mainly useful for more advanced uses. All tree index data structures support the ITreeIndex interface but have different set/get datatypes for theValue for the index notation bpt[key] = theValue. The interfaces IByteTree, IStringTree and IObjectTree defined specialized interfaces for trees that store byte arrays, strings, and serialized objects respectively. These interfaces inherit ITreeIndex and are defined in the ITreeIndex.cs source file.

On SerializedTrees

More sophisticated programmers may wish to use SerializedTrees to automatically store and retrieve serializable objects (using the .NET serialization methods). A SerializedTree can be built using any of the IByteTree implementations. The details and complexities involved in object serialization are beyond the scope of this document. (See JAVA/PYTHON NOTE 3.)

Feature Matrix

The following table summarizes the salient differences among the various index implementations. In the table the earlier implementations are simpler and more efficient than later ones, and thus the earlier listed implementations should be preferred when there is no reason to use a later one.

Index	Value type	Key Size	Safe for key prefix Collisions	Dictionary Order Key Traversal
BplusTreeLong	long integers	maximum fixed at creation	not applicable	yes
BplusTreeBytes	byte arrays	maximum fixed at creation	not applicable	yes
BplusTree	unicode strings	maximum fixed at creation	not applicable	yes
xBplusTreeBytes	byte arrays	unlimited	common prefixes in keys will cause collisions	yes
xBplusTree	unicode string	unlimited	common prefixes in keys will cause collisions	yes
hBplusTreeBytes	byte arrays	unlimited	none expected	key traversal order is random
hBplusTree	unicode strings	unlimited	none expected	key traversal order is random
SerializedTree	Serializable objects	same properties as the underlying tree implementation	same properties as the underlying tree implementation	same properties as the underlying tree implementation

Installation

The BplusDotNet distribution includes the dynamically loaded assembly BplusDotNet.dll which may be added as a reference to any .NET based programming project. Programmers with Visual Studio .NET may also open the project file BplusDotNet.csprof with VS.NET and rebuild the module.

Tests

The BplusDotNet distribution also includes the testing module which attempts to hammer the implementation in confusing ways. Execute testing.exe or compile and run the testing project to run the tests on your system. If it completes with no exception thrown then the tests have passed. By default the tests use "in memory structures" to emulate disk based files -- edit the testdirectory static variable to test to real disk files. (See JAVA NOTE 4.)

Deficiencies

At the time of this writing there are no known bugs. It is certain that the code could be refined and sped up a bit, but I don't think the speed difference would be noticable (I've been wrong before, however). Please use the project support procedure if you find bugs or wish to offer other improvements.

Guide to the modules

The package provides three indexed file implementations:

BplusTree.cs provides a bplusTree based file indexing structure which uses an index file to provide quick access to data stored in a separate data file. For BplusTree.cs both the keys and values are maintained as unicode strings, and the values may be arbitrarily large. xBplusTree.cs provides a similar implementation which allows keys to be arbitrarily large as well. hBplusTree.cs provides a similar implementation allowing arbitrarily long keys using a hashing scheme. SerializedTree.cs provides an implementation which automatically serializes and deserializes serializable values.

BplusTreeBytes.cs is similar to BplusTree.cs except the values are maintained as arbitrarily large arrays of bytes (unformatted Binary Large Objects). xBplusTreeBytes.cs provides a similar implementation which allows keys to be arbitrarily large as well. hBplusTreeBytes provides an analogous implementation using the hashing scheme described above.

BplusTreeLong.cs is the most primative bplus tree implementation where the values associated with keys are uninterpreted long integers (which are usually used to represent block numbers or seek positions into a separate data file. The BplusTreeLong.cs objects use a single file to represent the mapping of string keys to long integers.

Each of these tree implementations supports the ITreeIndex.cs interface which defines common methods for traversing, retrieving, modifying, and maintaining the index structures. In particular the ITreeIndex interface defines Commit and Abort methods which permits programmers to make groups of index modifications permanent or forget groups of modifications.

Furthermore there are two additional buffered file implementations

BufferFile.cs defines a wrapper for file streams which presents the file stream as a sequence of byte buffers. This basic wrapper is used by all the other modules.

LinkedFile.cs defines a wrapper over BufferFile which links fixed size buffers into linked lists of arbitrary length and also supports freeing and reallocation of unused buffers.

Caveat regarding Commit/Abort

The Commit and Abort methods will provide atomic transactional changes to indices under the assumption that the underlying Stream.Write operations are atomic. If any Stream.Write fails after making partial modifications to a file then the index structures may be damaged in a way that cannot be repaired.

Notes on the java implementation

Except for the serialized tree implementation all of the index types defined in the C# implementation have compatible java implementations. In particular indexes created using the C# modules can be opened and modified by the java modules and the reverse.

The java implementations reside in the package

	NET.sourceforge.BplusJ.BplusJ

and the test module resides in

	NET.sourceforge.BplusJ.testing;

JAVA/PYTHON NOTE 1: LOCALIZATION NOT YET SUPPORTED. The java implementation only permits the invariant culture id at this time BplusTreeLong.INVARIANTCULTUREID = 127.

JAVA NOTE 2: INDEXING METHODS. The subscripting notation for tree indexing cannot be used in java, so the java implementation uses a method notation instead

	tree.set(key, value); // replaces tree[key] = value;
	value = tree.get(key); // replaces value = tree[key];

JAVA/PYTHON NOTE 3: AUTOMAGIC SERIALIZATION/DESERIALIZATION NOT YET. There is no analogue to the SerializedTree provided in the java package at this time.

JAVA NOTE 4: EDIT THE TEST SUITE TO DEFINE THE TEST DIRECTORY. If you run the java test suite "out of the box" you will see something like this:

	C:\installstuff\BplusDotNet_beta4> javac NET\sourceforge\BplusJ\testing\bplustest.java
	C:\installstuff\BplusDotNet_beta4> java NET.sourceforge.BplusJ.testing.bplusTest
	encode/decode of ints ok
	encode/decode of longs ok
	encode/decode of longs ok
	TESTING BUFFERFILE
	to run these tests you need to edit the source file, adding a String definition
	for tempdirectory
	Exception in thread "main" ....

This is because the java test suite has no "test to memory" option, but must test to disk files. In order to make the tests run, do as requested and modify the definition for tempdirectory in bplusTest.java to point to an existing temporary directory

	static String tempdirectory = "c:\\tmp"; // set to a directory to test storing to/from files

The CompatTest() method (defined in C# and java) is intended to test the portability of the index file formats. To test the portability properly you need to define the tempdirectory variable to point to the same directory in each test driver program and run each test twice (java, C#, java, C#) in order test that one implementation can open a structure created by the other.

Notes on the Python implementation

To install the python implementation as a library use the setup bootstrap provided:

C:\installstuff\BplusDotNet_beta4>setup.py install

The python implementation also includes a testing.py test suite. This program is slower than the other two because it scans the complete contents of the trees very many times. In real use the Python implementation should not be noticably slower than the others. Once installed use the python implementation in a python program or from the command line, for example, as shown below:

C:\anydirectory>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on
Type "help", "copyright", "credits" or "license" for more information.
>>> import BplusPy.BplusTree
>>> tree = BplusPy.BplusTree.Initialize("/tmp/x.dat", "/tmp/y.dat", 15)
>>> tree["hello"] = "goodbye"
>>> print tree["hello"]
goodbye
>>> print repr(tree["hello"])
u'goodbye'
>>> tree.Commit()
>>> tree.ShutDown()

Note that all implementations automatically convert keys to unicode strings and the [xh]BplusTree implementations also automatically convert the values to unicode strings. (The [xh]BplusTreeBytes implementations keep values as python strings -- 8bit clean byte sequences).

Copywrite and Licensing

The bplusdotnet package is Copywrite Aaron Watters 2004. the package is licensed under the BSD open source license which means that you can do basically anything with it (except claim that you own it).