What is the difference between an array, an ordered collection of integer-indexed values, and a hash, a collection of key-value pairs, where every key is an index that starts from zero?
arr = [1,2,3,4]
hash = { 0 => 1, 1 => 2, 2 => 3, 3 => 4}
I know that it would be stupid to implement a hash like that, but what is the difference behind the hood? Are they stored in different ways in memory? What data structure makes your program faster when retrieving data from it?
CodePudding user response:
Speed
Arrays are a better for iterating over a collection.
Hashes have almost constant lookup time O(log n)
, while Arrays have linear time O(n)
.
This means that you should prefer Hashes when you need to use #include?
.
Hashes work great as dictionaries.
There is a third underutilised data structure in Ruby which behaves like a unique Array
but uses a Hash
underneath. This is Set
. It works great when you don't care about the order of elements, need to check if something belongs to a group or need a unique collection of elements.
Implemetation
Yes, they are two completely different data structures.
In this particular example you could access these elements in a similar manner by calling arr[0]
and hash[0]
.
The line arr[0]
is in fact sugar syntax for arr.[](0)
. Same goes for arr[0] = 5
which is the same as arr.[]=(0, 5)
. Both Array
and Hash
have methods called []
and []=
, but under the hood completely different things would happen, because these are two separate methods with the same name.
Arrays are always indexed with integers starting with 0 and the order of items is important.
Array
is a simple list of elements backed by a dynamic C array, while a Hash
is an implementation of a dictionary/hashmap, a much more sophisticated data structure.
Hashes store both keys and values, and possess the ability to retrieve a value based on the received key. Every single Ruby object that implements #eql?
and #hash
can become a Hash
key.
The order of elements in a Hash
can change and is not stable.
Every Hash
is in fact an array of slots called buckets. These buckets hold individual key-value pairs. To determine which key should correspond to a certain bucket, Ruby uses a hashing function, provided by the method called #hash
on the object passed as the key. If two objects return the same value from their #hash
method, they are considered the same key.
This goes both ways, Ruby uses #hash
to save the key-value pair in the right bucket, and to find the right value
for the passed key
.
For example "string".hash
will always return the same value, even though these are different objects.
This value is calculated based on the content of the String
. This is why String
s with the same content (even though they are different objects) are considered the same key in a Hash
.
Sometimes two different objects (of separate classes) end up having the same #hash
. Ruby handles these collisions by implementing individual buckets as linked lists of key-value pairs. So a few key-value pairs may be stored in the same bucket.
To decide which value should be returned for a given key, when there are a few key-value pairs in the same bucket, Ruby utilises the eql?
method.
It compares each key stored in the bucket with the passed key like so passed_key.eql?(key_in_bucket)
. When they are equal Ruby considers them a match.
Here's a great article on Hashes.
CodePudding user response:
Almost every object in Ruby has a hash
method. This method calculates some number which is unique-ish. That number is used to retrieve a key in a Hash object. If you want to get the value for 2
then the hash value for 2 is calculated (for an integer this is really fast) and looked up in the Hash object. So it looks like a hash is a way to over-complicate things - but if you want to be sure, benchmark your use-case is the way to go.