Sunday, April 20, 2008

Trie !!

Trie is a special kind of a tree (not necessarily binary, although it could be) that, given a finite set of alphabet, stores the words of its language in a very efficient way. It first begin by an empty node and then adds new nodes to this initial node. The maximum child nodes size is the size of alphabet and the maximum level of the tree is the length of the longest word in the language. For example, if the alphabet contains 26 letters and the longest word in it is of size 3, the Trie will have 3 levels with each of its nodes has at maximum 26 child nodes. This gives us a total of 26*26*26 nodes to store all possible combinations of 3 letters from letter0 to letter25. This is very efficient as there is no duplication in the Trie. If the language has these two words "for" & "form", they won't take extra space. Instead of storing them as a separate words, they will share all the common letters, and then a flag or something will be set to 1 when we reach the 'r' to indicate an end of a word. Then continue the word "form" till you reach the 'm' and set the flag to be 1 to indicate an end of another possible word. What if we want to add another word like "format", we will just add another child 'a' to that 'm' node we obtained in the previous example and add a new child 't' to it and set its flag to be 1 and so on.

Why is Trie so efficient ??!!
Imagine in the previous example that we have a language that has 3 words "for", "form" and "format", a typical way to store these words is to build a hash table or an array of strings or a set or any other data structure that will store each of these words in a separate place. There will be 3 strings of lengths 3, 4 and 6 which will sum up to 13 byte memory space. The concept of a Trie is not to "repeat" already existed nodes in the tree. There can't be 2 children of one parent having the same value as we saw in the example. So, total used memory for the words above is actually 6 bytes which is the length of the longest of them (of course we can use extra memory indicating end of words and so on). If we would like to build a dictionary for a complete language with all its words, we will find so many common prefixes for the words. Using a Trie in this case will save a lot of memory and a lot of time when searching.
Another example to make you feel the importance of using Trie, consider a language of only one alphabet which is letter 'a'. Words of this language can have length of at most 1000 letters (all 'a's of course). A Trie can represent such a language using 1000 nodes --> 1000c byte (c is the size of the node which is about 8 bytes in this case, 4 bytes to point to the node child and another 4 bytes to hold how many times this word is present in the dictionary or 0 if it's not a word). Any of the other possible ways that will represent each word on its own will need 1 + 2 + 3 + ... + 1000 bytes --> (1000 * 1001) / 2 bytes = 500500 bytes. Using simple math, 8000 / 500500 = 0.016 which means that Tries use 0.016 of the space used in a hash table or an array.

Tries also sorts the words in it & it actually doesn't need to do extra work to do so. If the alphabet is {a, b} letters then you make sure that when you try to add a word beginning with 'a' that you put it in the first child node and when you try to add a word beginning with b you put it in the second child node and so on. Hence, when searching the Trie after adding all required words (in no particular order) the search results will explore the words in their order without having to spend extra time sorting them.

For more information about Tries, you can visit Wikipedia, or find a c++ implementation for a Trie here. It already has the implementation of adding and deleting from the Trie. You can try to do the Search function as a practice, and here is a couple of problems to exercise.
902 - Password Search (You will never be able to solve this problem without tries :D. Anything else will get you Time Limit Exceeded)
10999 - Crabbles
(Never tried to do it without Tries actually. Don't know if it can be solved in any other way)

There is no wonder they always call it Data Structure & Algorithms as the development of some very efficient algorithms totally depend on how the data is structured in memory.

1 comment:

Anonymous said...

Hi.
i am a student like you.(i love acm and..)
i start to solve acm Problems.but
902 password Search!. i try to solve it. after so many CompE and RE (because gnu c++ compiler)limit...! i know that must use hash or map. but i have no any data about them.(i search net without any result)please help me.
arfn1366@yahoo.com