30 Jan, 2011, JohnnyStarr wrote in the 1st comment:
Votes: 0
Recently there was a discussion about std::map being either a hash or bst. I don't remember the exact thread
but someone clarified that it uses a tree. After a little research it appears to use a "Red-Black-Tree" which is a
self balancing BST. I've enjoyed researching these concepts.

My long term goal is to develop my own language in ANSI-C. I am considering implementing my own associative
array for obvious reasons. I am trying to outweigh the benefits of using either a RBT vs Hash.

Any thoughts? Or, are there any resources out there that you have seen? Such as available libraries with liberal licenses?
30 Jan, 2011, JohnnyStarr wrote in the 2nd comment:
Votes: 0
By the way, if anyone is interested in learning a bit on black-red-trees this lecture is pretty straight forward.
http://www.youtube.com/watch?v=JRsN4Oz36...
30 Jan, 2011, jurdendurden wrote in the 3rd comment:
Votes: 0
Thanks as always for sharing Johnny, intriguing business there, I'm very interested to have a go using your language once it's in a workable state. By the way, diggin the new avatar. :)
30 Jan, 2011, Scandum wrote in the 4th comment:
Votes: 0
Sorted arrays are an option as well for log N lookups, there may be others.
30 Jan, 2011, David Haley wrote in the 5th comment:
Votes: 0
Sorted arrays are, most of the time, a rather poor choice for implementing associative containers. You get less overhead than a tree for the same logarithmic performance, but having to resize and resort the array really hurts. It's a fine implementation for a static (i.e. unchanging) container.

A problem with the hash table implementation is that you need a decent hash function in order to get good performance. With a poor enough hash function, you might as well be using a linked list to implement your container. But, with a good hash function, you will get better performance with a hash table (assuming that you have enough hash buckets) than you would with a tree.

I guess it's not surprising that the answer here is that it depends on how exactly you implement these things. :smile:
31 Jan, 2011, JohnnyStarr wrote in the 6th comment:
Votes: 0
I still haven't figured out how the hash tables in MERC / ROM work yet. It seems they use the % operator and such.
Do they use a hash function as well? Are there measures to avoid collisions?

I want a hash/map structure mainly to have (string, string) pairs.
31 Jan, 2011, David Haley wrote in the 7th comment:
Votes: 0
The way (almost) all hash tables work is basically the following:

hash bucket = hash_function(element) % num_buckets

this is because you need to make sure that the numeric value returned by the hash function is in fact within your bucket range.

If memory serves, collisions are handled by moving on to the next bucket. But it's been a while so don't quote me on that (as it were). A common way of handling collisions is to have each bucket be a list of elements in that bucket.

I'm pretty sure you could find all sorts of good string hashing functions out there. Of course it depends on what exactly you're storing in those strings, but you shouldn't have much trouble with hashing strings.
31 Jan, 2011, Tyche wrote in the 8th comment:
Votes: 0
JohnnyStarr said:
I still haven't figured out how the hash tables in MERC / ROM work yet. It seems they use the % operator and such.
Do they use a hash function as well? Are there measures to avoid collisions?

I want a hash/map structure mainly to have (string, string) pairs.


The ROM room hash is an array of single linked lists.
The hash function is is the vnum modulus array size.
Collision detection is done by search through the linked list.
01 Feb, 2011, Scandum wrote in the 9th comment:
Votes: 0
Using a two dimensional sorted array you get O(sqrt n) insertion, which will give good performance for lists up to about a million elements.

I'd say the biggest problem with using a hash table is that resizing a hash table is a taxing operation.

The biggest downside of an rb tree is the complexity, the internet is littered with half finished example implementations. If you go that route it's probably best to use a library.

A skip list is easier to implement and the balancing algorithm is pretty straight forward if you keep track of node lengths.
01 Feb, 2011, David Haley wrote in the 10th comment:
Votes: 0
How would a two dimensional sorted array work?
What are measurements that establish one million elements as a breaking point?
How does such a data structure compare in performance to other implementations, like a hash table?

Resizing a hash table is not really any more taxing than resizing a normal array, unless your hash function is expensive (and it shouldn't).

RB trees are indeed non-trivial to implement.
01 Feb, 2011, Scandum wrote in the 11th comment:
Votes: 0
One sorted array would function as a table of buckets, its size the square root of the the total number of elements. Each bucket would contain a sorted array. If a bucket exceeds 1.5 times the optimal size it is split, if it falls below 0.5 times the optimal size two buckets are merged. The optimal size is (of course) the square root of all elements.

Due to the lower memory requirement (a third less), ease of cashing, and lower overhead (you'll be able to use memmove when inserting and deleting elements), operations on a 2d sorted array are going to be faster than on a rb tree, lets say 10 times faster (though a benchmark would be required to establish if it's more or less than this).

An rb tree with 1M elements will require 20 operations, but due to overhead this might be equivalent to 200 operations for a 2d array. A 2d sorted array would require 1000 operations, making it 5 times slower. (A 3d sorted array would require 100 operations; though with a larger coefficient).

In comparison, a rb tree with 256 element will require 8 operations, increased to an equivalent of 80, while a 2d sorted array would require 16 operations, making it 5 times faster.

Resizing a hash table is far more taxing than a single memmove operation on an array of pointers. Big O doesn't take this into consideration, but the coefficient does matter in practice.
02 Feb, 2011, David Haley wrote in the 12th comment:
Votes: 0
Out of curiosity, have you implemented such a data structure? There are a number of red flags that seem to make it a rather impractical solution. For instance, it seems difficult to merge buckets – how do you know where to find things later?

What is it? It sounds like you're making a hash table out of a 2d array. How does this work? How do you look something up? Why are you sorting?

Do you have references for this data structure? It seems like a rather unusual one so if you have further reading material it would be nice.

Do you have benchmarks for these performance claims? I think it would be easiest, since we are talking about actual implementations after all, to see some hard numbers if we're going to discuss performance metrics.

Why is resizing your 2d array just a question of moving around pointers with memmoze? (What pointers?)
03 Feb, 2011, Scandum wrote in the 13th comment:
Votes: 0
I'll probably implement this for fun sometime and see how well it works.
04 Feb, 2011, David Haley wrote in the 14th comment:
Votes: 0
So this is not a data structure that you've seen before? I'd never heard of it, so I was just curious where the suggestion came from… if you do end up implementing it (or have references to it) I'd be happy to take a look.
04 Feb, 2011, Scandum wrote in the 15th comment:
Votes: 0
Haven't seen it before, nor have I been able to find anything remotely like it on google. I'm not sure why the data structure hasn't been further explored, possibly because students are taught that sorted arrays are bad, and the common belief that writing to RAM is slower than reading from RAM (only twice as slow I think). As far as Wikipedia is concerned a sorted array doesn't exist.

As the data structure needs to scale it can't be implemented inside an actual two or three dimensional array. The best name I can come up with is an exponential search array, which gives zero hits.

There's probably an implementation detail I'm overlooking that makes the data-structure improbable, but a bare bone implementation shouldn't take too long.
04 Feb, 2011, JohnnyStarr wrote in the 16th comment:
Votes: 0
Not to interrupt here, but I have a question:

I have developed a hash table structure for my micro language project "pug".
There is more code than the following, but I think you should be able to figure out the problem.

It contains an array of list pointers:
typedef struct phash_list {
struct phash_list* next;
struct phash_list* prev;
char* key;
void* val;
} PHashList;

typedef struct phash {
unsigned int count;
PHashList* tbl[MAX_HASH];
} PHash;


My main struct "PHash" is allocated with pugH_new():
PHash* pugH_new() {
PHash* h = malloc(sizeof(PHash));
h->count = 0;
return h;
}


Now, we have two functions "insert" and "rawlookup".
My error is in rawlookup. Because on the lookup we iterate through all the PHashList pointers within the
array index.
/* main hash function */
static unsigned int gethash(char* str) {
unsigned int hash = 1315423911;
unsigned int i = 0;
unsigned int len = strlen(str);
for (i = 0; i < len; str++, i++)
hash ^= ((hash << 5) + (*str) + (hash >> 2));
return hash % MAX_HASH;
}

void pugH_insert(PHash *h, char *k, void *v) {
unsigned int hash = gethash(k);
PHashList *l = PHashList_new(k,v);
l->next = h->tbl[hash];
h->tbl[hash] = l;
h->count++;
}

void* pugH_rawlookup(PHash* h, char* k) {
PHashList *l;
for (l=h->tbl[gethash(k)]; l != NULL; l = l->next) {
if (!strcmp(k, l->key)) /* error! */
return l->val;
}
return NULL;
}


The code here works as long as the PHash contains the entry you look up. However, if you try
to access the PHashList* you have an access violation error. I am not certain how to avoid this.
04 Feb, 2011, JohnnyStarr wrote in the 17th comment:
Votes: 0
* Supplemental:

MAX_HASH is currently 101.
I understand that the array of pointers (PHashList->tbl[MAX_HASH]) isn't initialized.
Would I have to initialize the head pointer of each array index? I guess because PHash is allocated with malloc(),
it would need anything within it allocated before use?
04 Feb, 2011, David Haley wrote in the 18th comment:
Votes: 0
You should be allocating buckets dynamically, not with a fixed size. You should be able to grow the number of buckets.

Yes, you need to initialize the array of pointers. Each linked list node should be null initially. After the malloc, you need to zero out the entire array of pointers.

Any reason you're using C and not C++ here?
04 Feb, 2011, JohnnyStarr wrote in the 19th comment:
Votes: 0
ANSI C as a challenge. :biggrin:

Buckets refer to the PHashList pointers right?
When you say initialize the array of pointers, you mean iterate through each index in the array right?

Seems like initialization and deletion is going to be costly.
05 Feb, 2011, David Haley wrote in the 20th comment:
Votes: 0
You can either iterate through the array and initialize each pointer individually, or you can use a function like memset or memzero to set everything to zero in one go.

By the way, your insertion function has a bug: if you insert a key that already exists, you should not be doing the same thing as when the key is new. E.g., do not increment the count.
0.0/27