C++ Tip Of The Day: How does std::unordered

What are the differences between std::set and std::unordered_set ?

Here is a graphical and quantitative explanation of

how unordered_set works,
how fast it can access data,
how much memory it occypies, and
how it compares to std::set.

Exactly how many steps does it take unordered_set to access its data ?

I filled an unordered_set with random numbers. I've plotted a graph of how the data are distributed in the unordered_set.

I counted the number entries in each bucket, while entering random values in the unordered_set.

empty and filled buckets, as a percentage of number of entries in the hash table

I used this code to get the above graph.
You can see that:

50% of buckets contain 1 entry
17% of buckets contain 2 entries
3% of buckets contain 3 entries
0.4% of buckets contain 4 entries

In other words,

40%-60% of values in a hash table is in a bucket with 1 entry
30%-40% of values in a hash table is in a bucket with 2 entries
9% of values in a hash table is in a bucket with 3 entries
1.8% of values in a hash table is in a bucket with 4 entries
0.3% of values in a hash table is in a bucket with 5 entries

So, for the 97% of the values in a hash table it takes (on average) 1.3 steps to access them.

calculation method:

50% of values can be accessed in 1 step (avg = 1)

35% of values can be accessed in 1-2 steps (avg = 1.5)

9% of values can be accessed in 1-3 steps (avg = 2)

therefore avg number of steps = 50%*1 + 35%*1.5 + 9%*2 = 1.295

Why are there peaks in the graph?

Each time the number of entries in the set reached the bucket count, the number of buckets are doubled and the table is rehashed. Note, however, that rehashing does not copy the actual entries of the unordered_set, it only redistibutes the nodes of the linked lists of the bucket.

Why are there more than 100% empty buckets?

The number of buckets varies 1-2 times the number of actual entries that the set stored. About half the buckets in the table are generally empty. Having a lot of empty buckets is important because it increases the probability of a value not entering a full bucket.

How much space does std::unordered_set occupy?

The bucket count varies from 1-2 times the actual entries.

Each bucket contains a forward_list.

The size of a forward list is just one pointer to its first node.

The size of each node in the list is:

- the size of data itself, plus

- the size of a pointer pointing to the next list node.

Summung up, an unordered_set of n-entries may occupy

#bytes = bucket_count * sizeof(forward_list) +
n * (sizeof(data) + sizeof(forward_list_node))

therefore # of bytes may be

from n*(16+sizeof(data))
to n*(32+sizeof(data))

depending on the current bucket count (n to 2n). Assuming sizeof(void*) is 8 bytes.

How much space does a std::set occupy?

Each node of a std::set has a left, rught, and parent pointer. Plus the data. Thus:

#bytes = n * (3*sizeof(tree_node) + sizeof(data)

therefore # of bytes in a set is

exactly n*(24+sizeof(data))

Assuming sizeof(void*) is 8 bytes.

How does memory and performance compare between a std::set and a std::unordered_set ?

On average, you can expect same memory footprint for both std::set and std::unordered_set. Although unordered_set may vary, if you know beforehand the number of entries you're going to store,, you can reserve() the appropriate size, and you'll be better off than using a std::set.

Performance-wise,

std::unordered_set has O(1) complexity for insert and search, while
std::set has O(logN) complexity.

C++ Tip Of The Day

Thursday, November 7, 2013

How does std::unordered_set compare to std::set ?