Create Random Tree in R-CodePudding

Suppose that I want to create in R a binary tree on the interval (0,1) with maximum depth 3 in the following way:

First we have a pool of potential cut-offs for the tree t=c(0.1,0.2,0.3,0.4,0.5,0.6,0.7), a cut-off means that if we randomly choose the value 0.4 then we split the interval (0,1) to (0,0.4) and (0.4,1).

The steps that I want to do are:

1)Start with the whole interval (0,1)

2)Randomly choose a cut-off from t, denoted as t_1

3)Split the interval (0,1) based on the chosen cut-off i.e. to subintervals (0,t_1) and (t_1,1)

4)Then randomly choose one of the intervals (0,t_1) and (t_1,1)

5)For the chosen interval randomly sample from the cut-offs a point t_2 that makes sense, i.e. a point that it is not outside of the interval

6)Continue the procedure up to the point where we reach the maximum depth.

I'm totally clueless on where how to start. Is this the right forum to post such a question?

CodePudding user response：

Creating a tree structure like this requires a recursive function (i.e. a function that calls itself). The following function creates a list of nodes, where each branch node contains a split value, and two daughter nodes called left and right. The leaf nodes contains the final range encompassed within the leaf.

make_node <- function(min = 0, max = 1, desired_depth = 3, depth = 0) {
  
  if (depth < desired_depth) {
    split <- runif(1, min, max)
  list(split = split, 
       left = make_node(min = min, max = split, depth = depth   1),
       right = make_node(min = split, max = max, depth = depth   1))
  } else {
    list(range = c(min, max))
  }
}

It works like this. Let's create a reproducible tree:

set.seed(1)

tree <- make_node()

To get the initial splitting value, we do:

tree$split
#> [1] 0.2655087

So the right branch deals with all values between 0.2655087 and 1. To see where it splits this range, we do

tree$right$split
#> [1] 0.4136423

So this branch splits into values between [0.2655087, 0.4136423] on the left and [0.4136423, 1] on the right. Let's examine the left node:

tree$right$left$split
#> [1] 0.3985904

This has now split the [0.2655087, 0.4136423] branch into a left [0.2655087, 0.3985904] branch and a right [0.3985904, 0.4136423] branch.

If we take this right branch, have now reached depth 3, so we get the final range of this leaf and confirm its range:

tree$right$left$right
#> $range
#> [1] 0.3985904 0.4136423

Of course, to make all this easier you probably want some kind of function to walk the tree to classify a particular number.

walk_tree <- function(value, tree) {
  result <- paste("Value:", value, "\n")
  while(is.null(tree$range)) {
    if(value >= tree$split) {
      result <- paste(result, "\nGreater than split of", tree$split)
      tree <- tree$right
    } else {
      result <- paste(result, "\nLess than split of", tree$split)
      tree <- tree$left
    }
  }
  result <- paste0(result, "\nValue falls into leaf node with range [",
                  tree$range[1], ",", tree$range[2], "]\n")
  cat(result)
}

So, for example, we get

walk_tree(value = 0.4, tree)
#> Value: 0.4 
#>  
#> Greater than split of 0.2655086631421 
#> Less than split of 0.413642294289884 
#> Greater than split of 0.398590389362078
#> Value falls into leaf node with range [0.398590389362078,0.413642294289884]

You may prefer that this function returns a vector of 0s and 1s, or you may be looking for it to draw the tree, which is trickier to do, but possible.

^{Created on 2022-03-09 by the reprex package (v2.0.1)}

CodePudding user response：

Perhaps we can use Reduce to generate intervals in the binary-tree manner

Reduce(
  function(interval, k) {
    lb <- min(interval)
    ub <- max(interval)
    x <- v[v > lb & v < ub]
    if (!length(x)) {
      return(c(NA, NA))
    }
    p <- sample(x, 1)
    list(c(lb, p), c(p, ub))[[sample(1:2, 1)]]
  },
  1:3,
  init = c(0, 1),
  accumulate = TRUE
)

and you will see the result like

[[1]]
[1] 0 1

[[2]]
[1] 0.0 0.6

[[3]]
[1] 0.0 0.2

[[4]]
[1] 0.0 0.1

which indicates the selected interval in each iteration from top to bottom.