YetAnotherForum
Welcome Guest Search | Active Topics | Log In | Register

String hash questions
gage
#1 Posted : Wednesday, January 05, 2011 3:41:04 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 11/4/2010(UTC)
Posts: 30

Thanks: 0 times
Was thanked: 1 time(s) in 1 post(s)
Not sure which forum is the most appropriate so ill try here :)

Was there a common algorithm used for the hashing of strings fagiano? I ask because I'm interested in a string pooling structure and would like to be able to have my strings hash to the same values they would in squirrel so I only have to hash once, and then use a function like sq_getstringhash() to pull the hash out. Not sure if SQ 2.2 has a function like that but i dont have squirrel.h in front of me atm.

My current string hashing is done via CRC32 encoding and I'm not sure what method you are using in the API, have you had any issues with collisions?
cue
#2 Posted : Wednesday, January 05, 2011 6:11:55 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 1/3/2011(UTC)
Posts: 60
Man

Thanks: 0 times
Was thanked: 4 time(s) in 4 post(s)
I'm not Alberto, but the hash function used can be found in sqstring.h:

Code:

inline SQHash _hashstr (const SQChar *s, size_t l)
{
        SQHash h = (SQHash)l;  /* seed */
        size_t step = (l>>5)|1;  /* if string is too long, don't hash all its chars */
        for (; l>=step; l-=step)
            h = h ^ ((h<<5)+(h>>2)+(unsigned short)*(s++));
        return h;
}


The collision strategy can be found in sqtable.cpp within the NewSlot method. From the first look it seems like a simple linear search for the next free position.
gage
#3 Posted : Wednesday, January 05, 2011 7:31:41 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 11/4/2010(UTC)
Posts: 30

Thanks: 0 times
Was thanked: 1 time(s) in 1 post(s)
So does that mean its not unique (or at least unique enough?) that the hash is just used as a basic index, and if a collision is found you chain it? If so I was hoping the hash would have low chances of collision but no worries, i have a second strategy ill ponder, where ill go ahead and hash the string myself(CRC32) but store a squirrel string table reference as well as the C++ string itself.

In this fashion if i need to push a string into squirrel i just hash it in the table, find the SQOBJECT and push that, should be much faster than sq_pushstring i think. I'll have to investigate being able to convert the squirrel string into my hash without having to compute it every time i pop from the stack, but ill take a look... :P
fagiano
#4 Posted : Thursday, January 06, 2011 12:49:44 AM(UTC)
Rank: Advanced Member

Groups: Registered, Administrators
Joined: 6/11/2005(UTC)
Posts: 825

Thanks: 0 times
Was thanked: 36 time(s) in 29 post(s)
The hashes are unique enough :)... but they are not unique. if you don't have a 4 billion slots hash table you can't really do much better with a CRC. Immagine if you have a 4 slots table even if you have a unique 32 bits hash you still have 25% collision chance.

Saving an SQOBJECT somewere instead of sq_pushstring is much faster, sq_pushstring is pricy. I cache most of the strings I use for lookup.

ciao
Alberto
gage
#5 Posted : Thursday, January 06, 2011 3:50:04 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 11/4/2010(UTC)
Posts: 30

Thanks: 0 times
Was thanked: 1 time(s) in 1 post(s)
Im not too concerned with collisions as long as they are reported somewhere. what algorithm are you using? looks like the justin sobel hash. On the most recent title I shipped we actually had to jump to CRC64 due to some collisions we were getting with 32bit crc hash
fagiano
#6 Posted : Friday, January 07, 2011 1:32:47 AM(UTC)
Rank: Advanced Member

Groups: Registered, Administrators
Joined: 6/11/2005(UTC)
Posts: 825

Thanks: 0 times
Was thanked: 36 time(s) in 29 post(s)
I'm not sure what's the algorithm name, is the same as Lua 4.1 my hash table code started from that. What do you mean with collisions should be reported?

Alberto
gage
#7 Posted : Friday, January 07, 2011 4:50:06 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 11/4/2010(UTC)
Posts: 30

Thanks: 0 times
Was thanked: 1 time(s) in 1 post(s)
only reported if two different strings hashed to the same value and therefore some code would end up working incorrectly, say if the string == operator just compared the hash number instead of the char* array. I haven't dug into your strings much so I may just be talking about a problem that doesn't exist, just thinking out loud earlier :)
fagiano
#8 Posted : Friday, January 07, 2011 11:44:55 PM(UTC)
Rank: Advanced Member

Groups: Registered, Administrators
Joined: 6/11/2005(UTC)
Posts: 825

Thanks: 0 times
Was thanked: 36 time(s) in 29 post(s)
strings are unique, this is enforced at creation time. At creation time if a hahs collision occurs I also compare the content byte by byte. At runtime because they are unique, I simply compare the pointer, it the pointer is the same then the string is the same. (see SQStringTable) This approach is very common, squirrel, lua, python and many other languages approach strings this way. It implies slow creation but very fast look up and comparison.

I hope this clarifies
ciao
Alberto
gage
#9 Posted : Saturday, January 08, 2011 8:44:27 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 11/4/2010(UTC)
Posts: 30

Thanks: 0 times
Was thanked: 1 time(s) in 1 post(s)
actually thats perfect then, i don't mind slower creation time as long as the runtime use has consistent results
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Clean Slate theme by Jaben Cargman (Tiny Gecko)
Powered by YAF 1.9.4 | YAF © 2003-2010, Yet Another Forum.NET
This page was generated in 0.104 seconds.