.\" .\" (C) Copyright, 1988, 1989 Marcus J. Ranum .\" All rights reserved .\" .\" .\" This software, its documentation, and supporting .\" files are copyrighted material and may only be .\" distributed in accordance with the terms listed in .\" the COPYRIGHT document. .\" .\" $Header: /atreus/mjr/hacks/mud/btlib/RCS/btree.3,v 1.1 90/03/23 15:04:00 mjr Exp $ .\" .TH B\+TREE 3 "30 September 1989" .SH NAME b+tree \- variable length b+tree index management package .SH SYNOPSIS .B #include <sys/types.h> .br .B #include <btree.h> .sp .LP .B "BT_INDEX bt_open(path,flags,mode)" .br .B "char *path;" .br .B "int flags;" .br .B "int mode;" .LP .B #include <varargs.h> .br .B "BT_INDEX bt_optopen(va_alist)" .LP .B "int bt_delete(btree,key,len)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .LP .B "int bt_find(btree,key,len,rrn)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br .B "off_t *rrn;" .LP .B "int bt_goto(btree,d)" .br .B "BT_INDEX *btree;" .br .B "int d; /* BT_EOF or BT_BOF */" .LP .B "int bt_insert(btree,key,len,rrn,dupflg)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br .B "off_t rrn;" .br .B "int dupflg;" .LP .B "int bt_traverse(btree,d,kbuf,maxlen,retlen,retrrn)" .br .B "BT_INDEX *btree;" .br .B "int d; /* BT_EOF or BT_BOF */" .br .B "bt_chrp kbuf;" .br .B "int maxlen;" .br .B "int *retlen;" .br .B "off_t *retrrn;" .LP .B "void bt_perror(btree,s)" .br .B "BT_INDEX *btree;" .br .B "char *s;" .LP .B "int bt_rlabel(btree,buf,siz)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp buf;" .br .B "int siz;" .LP .B "int bt_wlabel(btree,buf,siz)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp buf;" .br .B "int siz;" .LP .B "int bt_sync(btree)" .br .B "BT_INDEX *btree;" .LP .B "int bt_close(btree)" .br .B "BT_INDEX *btree;" .LP .B "int bt_load(btree,key,len,rrn,flag)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br .B "off_t rrn;" .br .B "int flag; /* either BT_BOF, BT_OK, or BT_EOF */" .LP .B "int bt_zap(btree)" .br .B "BT_INDEX *btree;" .LP .B "The following are macros:" .LP .B "(int) bt_fileno(btree)" .br .B "BT_INDEX *btree;" .LP .B "(int) bt_pagesiz(btree)" .br .B "BT_INDEX *btree;" .LP .B "(int) bt_dtype(btree)" .br .B "BT_INDEX *btree;" .LP .B "(*int)() bt_cmpfn(btree)" .br .B "BT_INDEX *btree;" .LP .B "(int) bt_errno(btree)" .br .B "BT_INDEX *btree;" .LP .B "(void) bt_clearerr(btree)" .br .B "BT_INDEX *btree;" .LP .B "(int) BT_MAXK(btree)" .br .B "BT_INDEX *btree;" .LP .B "(int) BT_LABSIZ(btree)" .br .B "BT_INDEX *btree;" .SH DESCRIPTION .LP The b+tree library implements a variable-length key variable page size b+tree. A variety of options are supported, including user-settable cache modes, cache sizes, user defined data types, and several system defined data types. Support for variable length string keys with automatic prefixing is built in. The library does \fINOT\fR support any means of data storage for anything more than keys - for that the user must have access to a record management library (such as .B "store(3)" or .B "btdbm(3)" \). .LP The b+tree library stores all keys as variable length blocks of unsigned chars, and associates a \fIRemote Record Number (RRN)\fP with each key. This is defined as the typedef .B off_t and is typically used to store a disk address of a record someplace else in secondary storage. There is no provision for storing more associated data than the \fIRRN\fP in the index. The typedef .B bt_chrp is provided to increase portability and should be used in preference to a char pointer, though they may really be the same type. .SH "INITIALIZING A B+TREE" .LP When a b+tree is opened, an attempt is made to read a block of information at the beginning of the file which contains parameters such as page size, etc. If no bytes are read, indicating the file is empty, this is taken as an indication that the file should be initialized, and a new b+tree is created. To prevent confusion, a variety of checks are made, including verifying that a magic number is correct, and that any parameters provided make sense. If the open succeeds, a pointer to a dynamically allocated b+tree index is returned. In the event of an error, a NULL pointer is returned. .LP If an existing b+tree is re-opened, the information in the header block may override any options from the user, if appropriate. Thus, if a b+tree has already been created with one page size, attempts to use a different page size are politely ignored. .SH "OPENING A B+TREE" .LP There are two functions provided for opening a b+tree. The first, .B "bt_open()" is intended to look like the .B "open(2)" system call, and assumes that the caller wishes to open a b+tree with all the default options. These defaults are built into the library, and set the page size to some reasonable value in view of the system buffer size, cache size to some reasonable value, cache modes to something conservative, and the b+tree data type to string. .LP The second function for opening a b+tree, .B "bt_optopen()" allows a user to set a larger variety of options. The interface to the function is through a variable length argument list, terminated by a zero. Each of the arguments is interpreted in sequence as a request for a specific parameter in the b+tree to be set. Each request may be followed by one or more arguments, depending on the specific request. Omission of an argument, or the terminating zero will result in disaster. Not all of the request are required, and if the request is omitted, the default will be taken. Those marked as mandatory must be provided. .LP .B "Options to bt_optopen()" .LP .I BT_PATH: This option must be followed by a null-terminated string specifying the path name of the file to open. \fIThis value must be provided\fR. .LP .I BT_PSIZE: This option must be followed by an integer value indicating the desired page size to use. The default is to use a value based on the system buffer size. .LP .I BT_CACHE: This option must be followed by an integer value specifying the number of cache buffers to use in addition to the minimum number of cache buffers required for scratch pages. The default is to use a reasonable sized cache, which should be adequate. .LP .I BT_CFLAG: This option must be followed by a flag value which specifies the manner in which cache buffers are to be handled. The permitted values are defined as .I BT_NOCACHE .I BT_ROCACHE or .I BT_RWCACHE standing for (respectively) no cache, read-only cache, and read-write cache. The effects of these values is as follows: When no cache is indicated, pages are read from disk every time a read request is issued, and are written to disk with every write request. The minimal existing cache buffers are still required for splitting and insertion, but data is never taken from them. When read-only cache is in effect, reads may be returned directly out of cache, but all page write requests are still flushed immediately to disk. This means that programs which exit suddenly for some reason or another need be less concerned about whether or not the b+tree index has been flushed. The read-write cache option allows pages to be written directly into cache, where they are only flushed to disk when expired, or the tree is closed. This mode should save some time when building an index in batch mode. At any time, the .B "bt_sync()" function can be called to flush any modified pages to disk. Closing a b+tree also flushes the cache. .LP .I BT_OMODE: This option must be followed by an integer value which is used as flags to the .B "open(2)" system call. .LP .I BT_OPERM: This option must be followed by an integer value which is passed to the .B "open(2)" system call as the permissions information. .LP .I BT_MAGIC: This option must be followed by a long value which is used as the magic number for the b+tree. This replaces the default value. One side-effect of setting this value is that the correct value \fImust\fR be provided again if the tree is re-opened. .LP .I BT_DTYPE: This option must be followed by a flag value indicating the type of data that is being stored in the index. The permitted values are defined as .I BT_STRING .I BT_INT .I BT_LONG .I BT_FLOAT and .I BT_USRDEF meaning either string, integer, long, float, or user-defined. The string data type is the default, and there are functions built into the tree that improve string performance. Depending on the system architecture, the compiler, etc, you may have trouble with floats, ints, and longs, as well as user-defined data, due to alignment problems. This can be gotten around only in system-specific ways, so no solution is offered here. .LP If the .I BT_USRDEF flag is passed, another value must be passed as well, being a pointer to a function returning an integer - the comparison function. This function is expected to be called as: .br .B "(*cmpf)(key1,length1,key2,length1);" .br where \fIkey1\fR is the first key, and \fIlength1\fR is its length, and \fIkey2\fR is the second key, with \fIlength2\fR as its length. The function is expected to return zero if the two keys are equal in sort order, a value less than zero if the first key is less than the second key, and a value greater than zero if the first key is greater than the second. when using the user-defined data structures, since the keys may be variable-length, the user is responsible for structure alignment and other nasties. As long as the keys are a multiple of the system word size, there should be no alignment problems. .LP .I BT_LABEL: This option must be followed by a character pointer and an integer value. The character pointer is used as a label, and is stored in some spare space at the end of the b+tree file header. The integer value must contain the length of the label being provided. If there is insufficient room in the b+tree file header, an error will result. The label can also be read and set using the .B "bt_rlabel()" and .B "bt_wlabel()" functions. .SH "BT_OPTOPEN EXAMPLES" .nf .na .ft C BT_INDEX *p; p = bt_optopen( BT_PATH, "test.dat", BT_MODE, 0600, BT_OMODE, O_CREAT|O_TRUNC, BT_CFLAG, BT_RWCACHE, 0); .ft R .fi .ad .LP The above would open \fItest.dat\fR with mode 0600, and would create or truncate an existing file. Since the file will be truncated, a new b+tree index will be initialized. The data type of the index will be the default, as will the cache size. The cache, however, has been flagged to not immediately flush pages to disk on write. .br .sp .nf .na .ft C BT_INDEX *p; extern int mycompare(); p = bt_optopen( BT_PATH, "test2.dat", BT_DTYPE, BT_USRDEF, mycompare, BT_PSIZE, (BUFSIZ * 2), BT_LABEL, "foo", 3, 0); .ft R .fi .ad .LP The above would open \fItest2.dat\fR with page size of double the system buffer size. The data type stored is indicated to be user-defined, and a comparison function .B "mycompare()" is provided to compare keys. The b+tree index is labeled \fIfoo\fR for mysterious reasons. .SH "OTHER B+TREE FUNCTIONS" .LP .B "int bt_delete(btree,key,len)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br This function will search for the key and delete it from the index. If the key is not present in the index, the value .B BT_NF is returned. .LP .B "int bt_find(btree,key,len,rrn)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br .B "off_t *rrn;" .br This function will search for the key and store the data pointer in .B rrn, returning .B BT_NF if the key is not present in the index. The current location in the index is stored to "point" to the located key, or to the key where it should have been if it was not present. .LP .B "int bt_goto(btree,d)" .br .B "BT_INDEX *btree;" .br .B "int d; /* BT_EOF or BT_BOF */" .br This function will move the current b+tree location to either the highest key in the tree or the lowest, depending on the value in .B d. If the value is .B BT_EOF the locator will be moved to the highest key in the tree, if it is .B BT_BOF the locator moves to the lowest key. The current location in the index is stored if the call is successful. .LP .B "int bt_insert(btree,key,len,rrn,dupflg)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp key;" .br .B "int len;" .br .B "off_t rrn;" .br .B "int dupflg;" .br This function inserts the key in an index, with the data pointer value in .B rrn. Duplicate keys are not supported, and .B BT_ERR will be returned if an attempt is made to insert a duplicate value, unless .B dupflg is non-zero, in which case the value in .B rrn is used to replace the value currently stored in the index as a data pointer. This operation is more efficient than deleting a key and re-inserting it, and should be used where changing the data pointer is desired. When a new key is inserted in the index, the current location in the index is invalidated. Thus, inserting a key and calling .B bt_traverse will cause a seek to the beginning or the end of the index. .LP .B "int bt_traverse(btree,d,kbuf,maxlen,retlen,retrrn)" .br .B "BT_INDEX *btree;" .br .B "int d; /* BT_EOF or BT_BOF */" .br .B "bt_chrp kbuf;" .br .B "int maxlen;" .br .B "int *retlen;" .br .B "off_t *retrrn;" .br This function will move forward or backwards across the keys in the tree, depending on the value in .B d. If the value in .B d is .B BT_EOF the traverse will move towards the high end of the index, if it is .B BT_BOF the traverse will move towards the low end. If the traverse is successful, the next (or previous) key is placed in .B kbuf, the data pointer in .B retrrn, and the length of the key is stored in .B retlen. .B maxlen is checked to ensure that there is sufficient room in the user-provided buffer for the key, to prevent overruns. If an overrun would occur, the function returns .B BT_ERR and .B bt_errno for the index is set to .B BT_BTOOSMALL. If the end of the index is encountered, and the traverse can proceed no further, the value .B BT_EOF is returned by the function, or .B BT_BOF is returned if the beginning of the index is reached. If the current location in the index is undefined at the time of the call to .B bt_traverse a .B bt_goto is performed to the \fIopposite\fR end of the index from the direction in which the traverse is being performed. In this way, a tree can be opened and dumped by simply performing successive calls to .B bt_traverse. .LP .B "void bt_perror(btree,s)" .br .B "BT_INDEX *btree;" .br .B "char *s;" .br This function prints out any error messages currently associated with the index, in the manner of the .B "perror(3)" function. If there is no current error detected in the index, and a system error is present in .B errno, .B perror is called with the string .B s. .LP .B "int bt_rlabel(btree,buf,siz)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp buf;" .br .B "int siz;" .br This function reads the label from the index and places it in .B buf, checking the value of .B siz to ensure there is enough room. Note that the software keeps no count of how many bytes of data are in the label. .LP .B "int bt_wlabel(btree,buf,siz)" .br .B "BT_INDEX *btree;" .br .B "bt_chrp buf;" .br .B "int siz;" .br This function writes the contents of .B buf, to the index label, consisting of .B siz bytes. .LP .B "int bt_sync(btree)" .br .B "BT_INDEX *btree;" .br This function causes any dirty cache buffers in the index buffer cache to be flushed to disk. .LP .B "int bt_close(btree)" .br This function closes a b+tree index and frees all dynamically allocated memory. .LP .B "int bt_load(btree,key,len,rrn,flag)" .br This function performs an optimal in-order load of a set of keys. The keys must be presented in \fBREVERSE\fP sort order. The advantages this function confers are several, and it should be used any time an index is being bulk-loaded, or is being rebuilt. When an index is built with .B bt_load the pages are filled in descending order, rather than through random access search as in .B bt_insert. Additionally, each page is filled to capacity, which means that less space is taken up for the index, and less time is required to search the index. For large indices, the benefit of occasionally optimizing the index by rebuilding it with .B bt_load cannot be sufficiently emphasized. Storage savings up up to 50% can be realized, and search efficiency can be improved considerably. .LP To use .B bt_load, the function must first be called on an empty index, or one that is disposable, with the .B key, .B len, and .B rrn values equal to zero. The .B flag argument must be .B BT_BOF indicating that the file is to be initialized. This causes the tree to be re-initialized through a call to .B bt_zap. Once the first call the .B bt_load has been made, any number of subsequent calls can be made, with .B key, .B len, and .B rrn values to insert in the index. The .B flag argument should be set to .B BT_OK to indicate that the keys are valid keys. After all the keys have been inserted, a final call to .B bt_load must be made, with the .B key, .B len, and .B rrn equal to zero again, and the .B flag argument equal to .B BT_EOF indicating that the end of the load has been reached. At that time, further clean-up is performed, and the new b+tree index can be used normally. .LP For purposes of performance in running .B bt_load having a reasonable sized cache (about 3 spare pages) and the cache flags set to .B BT_RWCACHE will reduce load times. Increasing the cache by more than a moderate amount will not drastically improve load times, since the index is not searched during inserts. .LP During any of the calls to .B bt_load if an error condition is encountered, the usual .B BT_ERR will be returned. If the error is anything more serious than an oversized key, or zero-length key, the index will be unusable. .LP .B "int bt_zap(btree)" .br This function resets the b+tree index to an empty state, while retaining any set values for page size, etc. Note that this function will only free disk space that has been allocated on systems that have the .B "ftruncate(2)" system call. In order to free disk space on systems that do not have .B ftruncate the index file must be unlinked and re-created. .SH "MACROS" .LP Since these values are all macros, they should be used only with caution, to avoid side-effects. Mostly these should not be used by user-level code, but providing a common interface seemed better than the alternative. .LP .B "bt_fileno(btree)" points to the file descriptor (integer value) of the index. Users should not fiddle with this unless they know what they're about. .LP .B "bt_pagesiz(btree)" points to the size (integer value) of the b+tree index pages. .LP .B "bt_dtype(btree)" points to the data type if the index. It will be one of the values discussed in opening a b+tree. .LP .B "bt_cmpfn(btree)" points to the comparison function used by user-defined data type b+trees. It is possible to reset this on the fly, though the results are not under warranty. .LP .B "bt_errno(btree)" points to the integer error number value. Definitions of the error codes are in .B "btree.h". .LP .B "bt_clearerr(btree)" resets the index error value to indicate there is no error. This is performed automatically after every successful function call. .LP .B "BT_MAXK(btree)" yields an integer value that gives the largest possible size that a key can be, given the page size of the b+tree. This size is actually smaller than the largest size of a key that can really be fit in a page, but is calculated to reliably permit even splitting and promotion. .LP .B "BT_LABSIZ(btree)" yields an integer value that gives the largest possible amount of space that can be used for an index label. .SH "SEE ALSO" .SH "INTERNAL FUNCTIONS" .LP The following functions are internal functions used by the b+tree library. Care should be taken to avoid declaring functions with names that clash: .B bt_delpg, .B bt_dump, .B bt_freepage, .B bt_genprx, .B bt_inspg, .B bt_keyof, .B bt_newpage, .B bt_requeue, .B bt_rpage, .B bt_seekdown, .B bt_splpg, .B bt_srchpg, .B bt_wpage, .B bt_wsuper .LP In general, all the b+tree functions and macros are prefixed with .B bt_... and constants with .B BT_... .SH DIAGNOSTICS .LP The value .B BT_OK is returned whenever a function is successful. .LP The value .SM .B BT_ERR is returned to indicate some form of failure in any operation performed on a valid .B BT_INDEX. All valid b+tree indices contain their own error number that is set to indicate the cause of a failure, and can be accessed with the macro .B "bt_errno(btindex)" (where .B btindex is a valid b+tree index). A function .B "bt_perror(btindex,string)" (where .B string is a character pointer and .B btindex is a valid b+tree index) is provided, which prints an appropriate error message on the standard error. If the error is not b+tree-related, any existing system messages apropos existing conditions are printed by calling .B "perror()" Additionally, access to the error strings is available through the external array .B "bt_errs[]". Constant value codes for each error are defined in .B btree.h for symbolic reference. .LP Errors are cleared after every successful call to a function. They can also be cleared using the .B "bt_clearerr()" macro. .LP The value .SM .B NULL is returned to indicate that a .SM .B BT_INDEX pointer has not been initialized properly. Since no valid .B BT_INDEX is returned, .B "bt_perror()" cannot be used to determine the nature of an error. .LP The values .B BT_EOF and .B BT_BOF are returned if a call to .B "bt_traverse()" reaches the end or beginning of the index. .LP The value .B BT_NF is returned if a call to .B "bt_find()" fails to find the requested key, but encounters no error. .SH BUGS .LP There may be problems with pointer alignment on some architectures, especially if arbitrary structure alignment is not supported. .LP Due to the alignment problem, users defining their own data types must be careful to keep them of such a size that they align well, depending on architecture. Fixed-size user-defined structures will probably benefit by being padded out to some alignment boundary. Variable-size user-defined structures should perform thier own padding, even if it requires wasting some space. .SH LIMITATIONS .LP Every effort has been made to eliminate gratuitous limitations. There is no built-in limit to the depth allowed a tree, as an internal stack is maintained dynamically. There is no built-in limit to page sizes, numbers of keys, etc, except those inherent in the architecture. There is no fixed size to the internal cache, though there is a fixed minimum that will always be allocated for use as scratch buffers for page splitting, etc. When not being used as such, they are used to cache disk pages. Assume that at least 4 buffers will always be allocated. .SH ALGORITHMS .LP The algorithms used are basically the standard b+tree algorithm as described in countless texts. Some shortcuts are made. Since the keys can be variable length, the order of the tree is, perforce, variable. Splitting tries to fill roughly 1/2 of each page during a split, but with a deliberate bias towards the lower page in the split, since that is the one which may give up a key for promotion if the page is an internal one. Unlike some b+trees that have two different page structures for internal and leaf pages, this library uses the same structure for both, since no data is stored in the leaf. There is minor wastage as a result, but the size of the object code is kept down. .LP Deletes are not implemented in accordance with the b+tree algorithm, in that pages are not combined as they empty, nor are keys redistributed. A result is that search performance does not improve much as a tree empties, though it does not get worse either. Another side-effect is that while pages may empty out, if the index is re-filled, the inserts will be more efficient, since the likelihood of having to split a page drops. These assumptions hold true as long as the data is roughly randomly distributed across its range. .SH AUTHOR .LP Marcus J. Ranum