What is the difference between an array, an ordered collection of integer-indexed values, and a hash, a collection of key-value pairs, where every key is an index that starts from zero?

arr = [1,2,3,4]
hash = { 0 => 1, 1 => 2, 2 => 3, 3 => 4}

I know that it would be stupid to implement a hash like that, but what is the difference behind the hood? Are they stored in different ways in memory? What data structure makes your program faster when retrieving data from it?

CodePudding user response：

Speed

Arrays are a better for iterating over a collection.

Hashes have almost constant lookup time O(log n), while Arrays have linear time O(n). This means that you should prefer Hashes when you need to use #include?.

Hashes work great as dictionaries.

There is a third underutilised data structure in Ruby which behaves like a unique Array but uses a Hash underneath. This is Set. It works great when you don't care about the order of elements, need to check if something belongs to a group or need a unique collection of elements.

Implemetation

Yes, they are two completely different data structures.

In this particular example you could access these elements in a similar manner by calling arr[0] and hash[0].

The line arr[0] is in fact sugar syntax for arr.[](0). Same goes for arr[0] = 5 which is the same as arr.[]=(0, 5). Both Array and Hash have methods called [] and []=, but under the hood completely different things would happen, because these are two separate methods with the same name.

Arrays are always indexed with integers starting with 0 and the order of items is important.

Array is a simple list of elements backed by a dynamic C array, while a Hash is an implementation of a dictionary/hashmap, a much more sophisticated data structure.

Hashes store both keys and values, and possess the ability to retrieve a value based on the received key. Every single Ruby object that implements #eql? and #hash can become a Hash key. The order of elements in a Hash can change and is not stable.

Every Hash is in fact an array of slots called buckets. These buckets hold individual key-value pairs. To determine which key should correspond to a certain bucket, Ruby uses a hashing function, provided by the method called #hash on the object passed as the key. If two objects return the same value from their #hash method, they are considered the same key. This goes both ways, Ruby uses #hash to save the key-value pair in the right bucket, and to find the right value for the passed key.

For example "string".hash will always return the same value, even though these are different objects. This value is calculated based on the content of the String. This is why Strings with the same content (even though they are different objects) are considered the same key in a Hash.

Sometimes two different objects (of separate classes) end up having the same #hash. Ruby handles these collisions by implementing individual buckets as linked lists of key-value pairs. So a few key-value pairs may be stored in the same bucket.

To decide which value should be returned for a given key, when there are a few key-value pairs in the same bucket, Ruby utilises the eql? method. It compares each key stored in the bucket with the passed key like so passed_key.eql?(key_in_bucket). When they are equal Ruby considers them a match.

Here's a great article on Hashes.

CodePudding user response：

Almost every object in Ruby has a hash method. This method calculates some number which is unique-ish. That number is used to retrieve a key in a Hash object. If you want to get the value for 2 then the hash value for 2 is calculated (for an integer this is really fast) and looked up in the Hash object. So it looks like a hash is a way to over-complicate things - but if you want to be sure, benchmark your use-case is the way to go.