Hashing, creating an associative array?

Author

Message

Phaelax

DBPro Master

21

Years of Service

User Offline

Joined: 16th Apr 2003

Location: Metropia

Posted: 27th Feb 2014 15:18

Link

Basically, I'm trying to use a string as an array's index. I've looked up a few simple hashing algorithms online, but how do I wrap that to a number within the array's range?

http://zimnox.com

Back to top

Profile PM Email Website

Le Verdier

12

Years of Service

User Offline

Joined: 10th Jan 2012

Location: In the mosh-pit

Posted: 27th Feb 2014 16:51 Edited at: 27th Feb 2014 16:53

Link

Basically the returned hash should be a number so the index can be i=h mod arraysize
I read somewhere that arraysize as prime number give better result.
From here two options, depending of what it used for:
-The hash is intended to be unique: the program should complain the developper to chose another string
-Handle hash collision: the elements of the array are the start of a chained list. All the entries of the list have to be compared. Eg

Type entry
Nextentry
Entrystring$
Resultdata
Endtype

The point is for large arrays, there is only a few (depends of the entry count and array size) entries to compare, plus the hash compute cost

Back to top

Profile PM

Phaelax

DBPro Master

21

Years of Service

User Offline

Joined: 16th Apr 2003

Location: Metropia

Posted: 27th Feb 2014 18:41

Link

I got this hash algorithm online somewhere, maybe it's not correct. Because when I did try to mod the hash by the array size, most of my indices were 0.

Basically, I need to look up an array item based on a given string. I figured hashing would be better than looping through all elements until I find a match. Sorting and a binary search can speed that up but I still think this would be better if I can get it to work.

+ Code Snippet

size = 5

h1 = hash("John Smith")
h2 = hash("April smith")
h3 = hash("Bobby")
h4 = hash("Gorilla Glue")
h5 = hash("Taco Bell")
h6 = hash("Jersey")


i1 = h1 %% size
i2 = h2 %% size
i3 = h3 %% size
i4 = h4 %% size
i5 = h5 %% size
i6 = h6 %% size


i1 = mod(h1, size)
i2 = mod(h2, size)
i3 = mod(h3, size)
i4 = mod(h4, size)
i5 = mod(h5, size)
i6 = mod(h6, size)



repeat


    print(str(i1)+" : "+str(h1))
    print(str(i2)+" : "+str(h2))
    print(str(i3)+" : "+str(h3))
    print(str(i4)+" : "+str(h4))
    print(str(i5)+" : "+str(h5))
    print(str(i6)+" : "+str(h6))



    sync()
until getRawKeyPressed(27)=1
end




function hash(t$)
    hash = 5381
    `hash = 0
    for i = 1 to len(t$)
        c = asc(mid(t$, i, 1))
        hash = ((hash << 5) + hash) + asc(mid(t$, i, 1))
        `hash = c + (hash << 6) + (hash << 16) - hash
    next i
endfunction hash

http://zimnox.com

Back to top

Profile PM Email Website

Le Verdier

12

Years of Service

User Offline

Joined: 10th Jan 2012

Location: In the mosh-pit

Posted: 27th Feb 2014 19:38

Link

5 is too small !!
I tried 223 and the values seem OK..

This technique is worth for large amount of data..

Back to top

Profile PM

Marl

13

Years of Service

User Offline

Joined: 19th Nov 2011

Location: Bradford, UK

Posted: 27th Feb 2014 19:48 Edited at: 27th Feb 2014 19:54

Link

IMO, It's overly complicated, particularly for your needs.

You could probably get away with something like like;

+ Code Snippet

function hash( t$ )
    hash = 0
    for i = 1 to len(t$)
        c = asc(mid(t$, i, 1)) - 32
        // Lowest ascii character value is 32 for space
        hash = hash + c * i
    next i
endfunction hash

then use;

index = mod( hash( string$ ) , arraySize )

But as Le Verdier mentions, you have to code for duplicate hash values so unless you're using massive arrays a look-up would probably be a better option.

Edit:
you could change the addition to;

+ Code Snippet

hash = ( hash << 1) + c

Basically, you are just mixing it up so that words like "time" and "mite" produce different hashes even though they have the same letters (plain ascii adding would produce the same values)

Back to top

Profile PM Email

JimHawkins

15

Years of Service

User Offline

Joined: 26th Jul 2009

Location: Hull - UK

Posted: 27th Feb 2014 19:59

Link

It's usual theory that your array size should be at least twice as big as the largest amount of data. You also need to allow for collisions, and then have a non-hashed linked list of the collision entries.

It's really not worth it for small numbers of items. Very powerful for large systems, though. I use an array with 50000 slots for our server software to match a computer name to other data. Takes less than a millisecond, on average.

-- Jim - When is there going to be a release?

Back to top

Profile PM Website

The Daddy

15

Years of Service

User Offline

Joined: 13th Jan 2009

Location: Essex

Posted: 27th Feb 2014 20:27

Link

Jim

Do you use a hash table or sql?

www.bitmanip.com
All the juicy you could ever dream of!

Back to top

Profile PM Email Website

Le Verdier

12

Years of Service

User Offline

Joined: 10th Jan 2012

Location: In the mosh-pit

Posted: 27th Feb 2014 20:32 Edited at: 27th Feb 2014 20:34

Link

It is not related to strings but it used hashing in my agttc game. It is not string but 3d coords.
It is very efficient for sorting the nearest objects out of hundreds of items.

+ Code Snippet

   wgi = ((Int(X Vector3(_VShipPos)/16.0) - ((X Vector3(_VShipPos)/16.0) < 0.0)) * 0x8da6b343 + (Int(Y Vector3(_VShipPos)/16.0) - ((Y Vector3(_VShipPos)/16.0) < 0.0)) * 0xd8163841 + (Int(Z Vector3(_VShipPos)/16.0) - ((Z Vector3(_VShipPos)/16.0) < 0.0)) * 0xcb1ab31f) Mod 4096
    wgi = (wgi >= 0) * wgi + (wgi < 0) * (wgi+4096)

    QuadInfo = 0
    Inside = 0
    Entry = MemBlock Word(3, wgi*2)
    If Entry
    For i = 0 To 1
        Quad = MemBlock Word(4, Entry*4) * 64
        NextEntry = MemBlock Word(4, Entry*4+2)
(...)

        i = (NextEntry = 0) Or Inside
        Entry = NextEntry
    Next i
    EndIf

Back to top

Profile PM

Phaelax

DBPro Master

21

Years of Service

User Offline

Joined: 16th Apr 2003

Location: Metropia

Posted: 27th Feb 2014 20:35

Link

I thought collisions were only a problem if I had a lot of data. And if I'd have to double my array size, then a hash table isn't exactly memory efficient it looks like.

If I had a couple hundred elements, you think a lookup would still be a better option? I don't have enough data to know yet exactly how big this array can be. Could be 5, could be 1k.

http://zimnox.com

Back to top

Profile PM Email Website

JimHawkins

15

Years of Service

User Offline

Joined: 26th Jul 2009

Location: Hull - UK

Posted: 27th Feb 2014 23:58 Edited at: 27th Feb 2014 23:59

Link

There's no such thing as a perfect hashing algorithm my elders and betters tell me.

This is the fairly simple one I use:

+ Code Snippet

function StrToHash(const s : AnsiString) : Cardinal;
var i : integer;
begin
  s:= Uppercase(s);
  result := length(s)*111;
  if s='' then exit;
  for i := length(s) downto 1 do
   result:= result + (ord(s[i])* 2049);
end;

This produces an unsigned number which we can then MOD with the size of the hashtable (in my case a big file) to get a starting point. If the slot is empty we simply put the data in. If it's not, we have to traverse a linked list off this point and find an empty slot. We then link the previous entry down to this index.

It's quite complex to code the insertion:

function AddStudentRecord(hfile:TFilestream;thiskey:ansistring;staffkey:ansistring;thisini:ansistring):boolean;
var initial, index,traverse,j,previous:cardinal;
    recfind,rectraverse,recnew: THashRecord;

begin
  result := false;
  if not assigned(hfile) then exit;

fillchar(recnew,HashRecordSize,#0);
  for j :=  1 to length(thiskey) do
    recnew.key[j-1]:= thiskey[j];

for j :=  1 to length(staffkey) do
    recnew.staffid[j-1]:= staffkey[j];

if length(thisini)>0 then
  begin
    for j :=  1 to length(thisini) do
    recnew.ini[j-1]:= thisini[j];
    recnew.hasIni:= 1;
  end;

recnew.isActive:= 1;
  WLog(thiskey);
  hfile.Seek(0, soBeginning);
  hfile.Write(recnew,hashRecordsize);
  // compute the primary hash key
  initial := StrToHash(thiskey);
  index := (initial) mod HashTableMax;
  hfile.Seek(index*HashRecordSize, soBeginning);
  hfile.Read(recfind,HashRecordSize);
  // do we have a collision?
  if recfind.isActive=1 then
  begin
    inc(collisions);
    previous:= index;

repeat
    
    traverse := (previous+1) mod HashTableMax;
    WLog('bucket: '+ inttostr(traverse));
    hfile.Seek(traverse * HashRecordSize,soBeginning);
    hfile.Read(rectraverse,HashRecordSize);
    if rectraverse.isActive=1 then previous := traverse;

until not (rectraverse.isActive=1);

// now put new data in and update the pointer to the last traversed field
    WLog(format('previous = %d current = %d',[previous,traverse]));
    hfile.Seek(traverse * hashrecordsize,soFromBeginning);
    hfile.Write(recnew,hashRecordSize);
    hfile.Seek(previous*hashrecordsize,soBeginning); // should now point at previous active record
    hfile.Read(rectraverse,HashRecordSize);
    rectraverse.uplink := traverse;
    hfile.Seek(previous*hashrecordsize,soBeginning); // could wind back, but this builder is not time-critical
    hfile.Write(rectraverse,hashRecordSize);
  end
  else
   begin
     // simple insertion
     hfile.Seek(index*hashrecordsize, soBeginning);
     hfile.Write(recnew,hashRecordsize);
     Wlog('simple: '+inttostr(index));
   end;

end;

+ Code Snippet

function AddStudentRecord(hfile:TFilestream;thiskey:ansistring;staffkey:ansistring;thisini:ansistring):boolean;
var initial, index,traverse,j,previous:cardinal;
    recfind,rectraverse,recnew: THashRecord;

begin
  result := false;
  if not assigned(hfile) then exit;

  fillchar(recnew,HashRecordSize,#0);
  for j :=  1 to length(thiskey) do
    recnew.key[j-1]:= thiskey[j];

  for j :=  1 to length(staffkey) do
    recnew.staffid[j-1]:= staffkey[j];

  if length(thisini)>0 then
  begin
    for j :=  1 to length(thisini) do
    recnew.ini[j-1]:= thisini[j];
    recnew.hasIni:= 1;
  end;

  recnew.isActive:= 1;
  WLog(thiskey);
  hfile.Seek(0, soBeginning);
  hfile.Write(recnew,hashRecordsize);
  // compute the primary hash key
  initial := StrToHash(thiskey);
  index := (initial) mod HashTableMax;
  hfile.Seek(index*HashRecordSize, soBeginning);
  hfile.Read(recfind,HashRecordSize);
  // do we have a collision?
  if recfind.isActive=1 then
  begin
    inc(collisions);
    previous:= index;

   repeat
    
    traverse := (previous+1) mod HashTableMax;
    WLog('bucket: '+ inttostr(traverse));
    hfile.Seek(traverse * HashRecordSize,soBeginning);
    hfile.Read(rectraverse,HashRecordSize);
    if rectraverse.isActive=1 then previous := traverse;

   until not (rectraverse.isActive=1);

   // now put new data in and update the pointer to the last traversed field
    WLog(format('previous = %d current = %d',[previous,traverse]));
    hfile.Seek(traverse * hashrecordsize,soFromBeginning);
    hfile.Write(recnew,hashRecordSize);
    hfile.Seek(previous*hashrecordsize,soBeginning); // should now point at previous active record
    hfile.Read(rectraverse,HashRecordSize);
    rectraverse.uplink := traverse;
    hfile.Seek(previous*hashrecordsize,soBeginning); // could wind back, but this builder is not time-critical
    hfile.Write(rectraverse,hashRecordSize);
  end
  else
   begin
     // simple insertion
     hfile.Seek(index*hashrecordsize, soBeginning);
     hfile.Write(recnew,hashRecordsize);
     Wlog('simple: '+inttostr(index));
   end;

end;

It's absolutely not worth going to all this trouble for small lists. If the data is sorted, a binary search is very fast. If it's coming in randomly, a tree can be good, but if it's sorted a tree just becomes a list (degenerate).

Comparing an unsigned number is always massively faster than comparing strings, so for less than (say) 1000 items if you have two arrays - the strings(string) and the hashcode (integer) which are kept in sync you can:

* Compute the hashcode for the string you're searching for.
* Traverse the integer array. If the numbers match, then compare the strings.
* If you found it - fine - if not either add it or say you didn't find it.

A huge amount depends upon how many data items you want to compare, so there's no easy answer to which algorithm to use. The same thing applies to sorting. If you have <50 items, a Bubble Sort may actually be simpler and quicker than a Quicksort.

-- Jim - When is there going to be a release?

Back to top

Profile PM Website

Sorry your browser is not supported!

AppGameKit Classic Chat / Hashing, creating an associative array?