bplustreedotnet -- A B+tree implementation for C#, java, and Python with related modules |
quickstart bplusdotnet project page with download links |
Executive Summary: The bplusdotnet package is a library of cross compatible data structure implementations in C#, java, and Python which are useful for applications which need to store and retrieve persistent information. The bplusdotnet data structures make it easy to store string keys associated with values permanently. For example you could use bplusdotnet objects to associate names to addresses, or phone numbers to names, or book titles to book content, or names to face images. The bplusdotnet package is an open source indexed file implementation hosted on SourceForge.Key features of the package include:
Very fast retrieval of values associated to key strings. Cross platform portable file formats between standard C#, java, and Python implementations. No practical data file size limitations. Support for Unicode strings with localized string orderings. (localization for C# only) Key values as unicode strings with either unlimited length or maximum length fixed at the creation of the index. Mapping values as unicode strings or byte arrays of arbitrary size or as long integers. Random access to keys and values by key equality. Sequential access to "larger" keys and values by localized Unicode key ordering. Support for commits or aborts for groups of index modifications. Hash based indexing options for better expected performance in some cases. Optional automatic serialization and deserialization of serializable objects. (C# only) Configurable "core memory footprint" usage. Automatic reclamation of unused file space. Data structure recovery in some cases of file corruption.
General Technical Notes: the BplusDotNet package provides a variant of the classical BplusTree file indexing structure, which is one of the most beautiful and useful inventions of computer science, with significance to civilization rivalling the invention of the arch, double entry accounting, and arabic numerals. This variant differs from the classical presentation in that it does not maintain "next leaf" pointers in leaf nodes in order to provide a higher resilience to errors in the case of program failures -- in this way a "commit" or "abort" can be made final with the completion of a single disk buffer write. To provide additional transaction resilience no disk buffer is ever modified in place -- new data is always written to new buffers and the old buffers are release for reuse only after the root record has been committed.For additional discussion of B+trees and related structures please see The Ubiquitous B-Tree (Computing Surveys, Vol. 11, No. 2, June 1979) also available at http://bibin.hio.no/sivbib2000/databaseteori/Ubiqbtre.htm as of this writing.
The tree interface source documentation are kept primarily in the ITreeIndex.cs interface definition file to avoid redundancy in the implementation files.
The BplusTree structure provides the most accessible interface for storing and retrieving string entries associated with string keys. For BplusTree the key length must map to a sequence of bytes (using utf8 encoding) which doesn't exceed the key length defined at the time the tree is created. The xBplusTree structure removes this keylength limitation (see below).
// CREATE A NEW BPLUSTREE USING FILE PATHS public static BplusTree Initialize(string treefileName, string blockfileName, int KeyLength, int CultureId, int nodesize, int buffersize) public static BplusTree Initialize(string treefileName, string blockfileName, int KeyLength, int CultureId) public static BplusTree Initialize(string treefileName, string blockfileName, int KeyLength) // CREATE A NEW BPLUSTREE USING AN EXISTING STREAM public static BplusTree Initialize(System.IO.Stream treefile, System.IO.Stream blockfile, int KeyLength, int CultureId, int nodesize, int buffersize) public static BplusTree Initialize(System.IO.Stream treefile, System.IO.Stream blockfile, int KeyLength, int CultureId) public static BplusTree Initialize(System.IO.Stream treefile, System.IO.Stream blockfile, int KeyLength)Each of these methods create a BplusTree with an associated tree index stream, an associated block data stream for values, a maximum key size (in bytes), an associated Culture for determining the key ordering, a maximum number of children per leaf nodes (nodesize), and a buffersize for breaking value strings into chunks in the data file.
In the signatures above the files may be provided as prepared streams or may be specified as file system path name strings.
The key size determines how large keys may be in the index -- if it is set too low you will not be able to place large enough keys into the structure, if it is set too high you will waste space in the index file.
The CultureId determines the ordering to use for sorting the key values. The default used is System.Globalization.CultureInfo.InvariantCulture.LCID. (See JAVA/PYTHON NOTE 1.)
The nodesize and buffersize parameters may be useful for optimization and testing, but most users may find the default values acceptible.
public static BplusTree ReOpen(System.IO.Stream treefile, System.IO.Stream blockfile) public static BplusTree ReOpen(string treefileName, string blockfileName)
To associate one string with another using a Bplustree bpt write
bpt["one string"] = "another string"; bpt["a second string"] = "more information";To get the associated value back write
Console.WriteLine("the value for 'one string' is "+bpt["one string"]) // prints "the value for 'one string' is another string\r\n" to the console(See JAVA NOTE 2.)
If you attempt to retrieve a key which has no associated value the bpt data structure with throw a BplusTreeException. To test whether a key has an associated value use the ContainsKey method.
if (bpt.ContainsKey("no such key yet")) { Console.WriteLine("the value for 'no such key yet' is "+bpt["no such key yet"]); } else { Console.WriteLine("the key 'no such key yet' is not in the index"); }
The assignment made above will not be made "permanent" for future use, however until the index bpt is committed, using
bpt.Commit();Alternatively, to discard all modifications since the last commit point (or since the index was opened if that was more recent) use
bpt.Abort();To change the value associated with a key use the same index assignment notation.
bpt["one string"] = "an alternate string";To remove the association of a key with a value use
bpt.RemoveKey("a second string");The NextKey method gets the next larger key than the argument or null if there is no larger key. So to list all the keys and values between keys "russia" to the end of the index
string currentkey = "russia"; while (currentkey!=null) { if (bpt.ContainsKey(currentkey)) { Console.WriteLine("key="+currentkey+" value="+bpt[currentkey]); } currentkey = bpt.NextKey(currentkey); }If you want to start at the first key of the index use
string currentkey = bpt.FirstKey();When you are done with an index which you have modified it is good practice to either Commit or Abort the modifications. If you wish to shut down a modified index immediately without going throught the abort procedure you may use the ShutDown method, which is equivalent to aborting but may leave some space in the index and data files unreachable. To attempt to recover unreachable space in the files after reopening an index use the Recover(true) method.
The methods to create a xBplusTree have the same signature as those for creating a BplusTree except for the return type. For example here is the most verbose overload for BplusDotNet.xBplusTree.Initialize:
xBplusTree Initialize(string treefileName, string blockfileName, int PrefixLength, int CultureId, int nodesize, int buffersize)the only difference being that PrefixLength replaces KeyLength. The PrefixLength value determines the number of bytes of each key which is used to locate it within the internal tree structure.
After creation an xBplusTree presents the exact same interface as a BplusTree except that all key values will be permitted in the index.
NOTE: The xBplusTree implementation is more expensive and slower than the BplusTree implementation -- so the BplusTree implementation should be preferred whenever this is acceptible. Also, an xBplusTree may perform poorly when, for example, 100000 strings are identified by the same byte prefix. It is a good idea to pick a PrefixLength large enough so that only a few key strings are expected to have the same prefix for the given application.
The hBplusTree variant uses a hashing scheme to allow keys of unlimited length with expected good performance regardless of the key values. However since the hashing mechanism randomizes the placement of the key values in the underlying tree implementation the FirstKey and NextKey methods will not present the keys in dictionary order, but will traverse the keys in an arbitrary order.
Index | Value type | Key Size | Safe for key prefix Collisions |
Dictionary Order Key Traversal |
---|---|---|---|---|
BplusTreeLong | long integers | maximum fixed at creation | not applicable | yes |
BplusTreeBytes | byte arrays | maximum fixed at creation | not applicable | yes |
BplusTree | unicode strings | maximum fixed at creation | not applicable | yes |
xBplusTreeBytes | byte arrays | unlimited | common prefixes in keys will cause collisions |
yes |
xBplusTree | unicode string | unlimited | common prefixes in keys will cause collisions |
yes |
hBplusTreeBytes | byte arrays | unlimited | none expected | key traversal order is random |
hBplusTree | unicode strings | unlimited | none expected | key traversal order is random |
SerializedTree | Serializable objects | same properties as the underlying tree implementation |
same properties as the underlying tree implementation |
same properties as the underlying tree implementation |
The BplusDotNet distribution includes the dynamically loaded assembly BplusDotNet.dll which may be added as a reference to any .NET based programming project. Programmers with Visual Studio .NET may also open the project file BplusDotNet.csprof with VS.NET and rebuild the module.
The BplusDotNet distribution also includes the testing module which attempts to hammer the implementation in confusing ways. Execute testing.exe or compile and run the testing project to run the tests on your system. If it completes with no exception thrown then the tests have passed. By default the tests use "in memory structures" to emulate disk based files -- edit the testdirectory static variable to test to real disk files. (See JAVA NOTE 4.)
Each of these tree implementations supports the ITreeIndex.cs interface which defines common methods for traversing, retrieving, modifying, and maintaining the index structures. In particular the ITreeIndex interface defines Commit and Abort methods which permits programmers to make groups of index modifications permanent or forget groups of modifications.
Furthermore there are two additional buffered file implementations
The java implementations reside in the package
NET.sourceforge.BplusJ.BplusJand the test module resides in
NET.sourceforge.BplusJ.testing;
JAVA/PYTHON NOTE 1: LOCALIZATION NOT YET SUPPORTED. The java implementation only permits the invariant culture id at this time BplusTreeLong.INVARIANTCULTUREID = 127.
JAVA NOTE 2: INDEXING METHODS. The subscripting notation for tree indexing cannot be used in java, so the java implementation uses a method notation instead
tree.set(key, value); // replaces tree[key] = value; value = tree.get(key); // replaces value = tree[key];
JAVA/PYTHON NOTE 3: AUTOMAGIC SERIALIZATION/DESERIALIZATION NOT YET. There is no analogue to the SerializedTree provided in the java package at this time.
JAVA NOTE 4: EDIT THE TEST SUITE TO DEFINE THE TEST DIRECTORY. If you run the java test suite "out of the box" you will see something like this:
C:\installstuff\BplusDotNet_beta4> javac NET\sourceforge\BplusJ\testing\bplustest.java C:\installstuff\BplusDotNet_beta4> java NET.sourceforge.BplusJ.testing.bplusTest encode/decode of ints ok encode/decode of longs ok encode/decode of longs ok TESTING BUFFERFILE to run these tests you need to edit the source file, adding a String definition for tempdirectory Exception in thread "main" ....This is because the java test suite has no "test to memory" option, but must test to disk files. In order to make the tests run, do as requested and modify the definition for tempdirectory in bplusTest.java to point to an existing temporary directory
static String tempdirectory = "c:\\tmp"; // set to a directory to test storing to/from filesThe CompatTest() method (defined in C# and java) is intended to test the portability of the index file formats. To test the portability properly you need to define the tempdirectory variable to point to the same directory in each test driver program and run each test twice (java, C#, java, C#) in order test that one implementation can open a structure created by the other.
C:\installstuff\BplusDotNet_beta4>setup.py installThe python implementation also includes a testing.py test suite. This program is slower than the other two because it scans the complete contents of the trees very many times. In real use the Python implementation should not be noticably slower than the others. Once installed use the python implementation in a python program or from the command line, for example, as shown below:
C:\anydirectory>python Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on Type "help", "copyright", "credits" or "license" for more information. >>> import BplusPy.BplusTree >>> tree = BplusPy.BplusTree.Initialize("/tmp/x.dat", "/tmp/y.dat", 15) >>> tree["hello"] = "goodbye" >>> print tree["hello"] goodbye >>> print repr(tree["hello"]) u'goodbye' >>> tree.Commit() >>> tree.ShutDown()Note that all implementations automatically convert keys to unicode strings and the [xh]BplusTree implementations also automatically convert the values to unicode strings. (The [xh]BplusTreeBytes implementations keep values as python strings -- 8bit clean byte sequences).
The bplusdotnet package is Copywrite Aaron Watters 2004. the package is licensed under the BSD open source license which means that you can do basically anything with it (except claim that you own it).