Why this BST (Binary Search Tree) implemented in Ruby is extremely slow?-CodePudding

I have created this simple BST in Ruby:

class Node
  attr_accessor :data, :left, :right
  
  def initialize data
    @data = data
  end
end

def insert root, data
  current = root
  
  loop do
    node = current.data < data ? current.right : current.left
    break if node.nil?
    current = node
  end
  
  if current.data < data
    current.right = Node.new(data)
  else
    current.left = Node.new(data)
  end
end

Note: I don't use recursion because I get a "stack too deep" exception when the BST becomes large.

Then I try to insert about 6M records:

root = Node.new('root')
insert root, "something"
insert root, "another"
# ...

It takes about 30 minutes to complete on my MacOS, which seems really too much!

Why it takes so long (with SQLite I can index in 30 seconds the same data)?

Can you suggest any optimization to the above code?

CodePudding user response：

Is Ruby slow?

Here's a StackOverflow question with highly detailed answers. In short, comparing Ruby's performance with SQLite's is pointless; only one of these tools is optimised to work with data.

Optimizing the approach

Since you're using a BST, each insertion takes O(n) time in the worst case (which is a skewed tree). Inserting n nodes would have O(n^2) complexity, which might explain the high running time. You could replace the BST with a height balanced BST (also called AVL tree), where all insertion/deletion/search operations can be done in O(log n)

CodePudding user response：

I'm not sure if you strictly need a BST, but in case you can consider an alternative, having your data stored in an array and having it sorted (hopefully only once) would be way faster.

The following takes 2 seconds on my machine, whereas I stopped your BST implementation after 30 seconds.

data = []
6_000_000.times{|i|
  data << rand(1_000_000)
}
# data << 3000 #force 3000 to be there... 
data.sort! # only need to sort once
p data.bsearch{ |x| 3000 <=> x }

If you need to keep adding data, maybe re-sorting all the time would still be better.