I have been recently come across a problem that entails using a decision tree that only uses one continuous variable to divide the predictor on multiple threshold, while some splits result in the same decision. The decision tree is as follows:
So I have two questions:
- Does it make sense to use one variable multiple times for dividing the predictor?
- Is it possible that two splits result in the same decision?
I didn't share any code or data set cause I just want to know whether this condition is possible and why we would prefer this.
CodePudding user response:
In general, yes, it does make sense to divide on a variable multiple times. By dividing multiple times, you can get more than just a binary partition.
For example, if you wish to decide YES if 100 < balance < 500 and NO otherwise, you could do this with multiple partitions on the same variable:
balance < 100
/ \
NO balance < 500
/ \
YES NO
For your example above, it does not make sense for all leaves after a partition to lie in the same categorical decision. Your tree is equivalent to:
balance < 1890.64
/ \
NO YES
In practice, there may sometimes be more information associated just the category (YES or NO) such as the probability of YES/NO. In this case it would make sense.
CodePudding user response:
Does it make sense to use one variable multiple times for dividing the predictor?
Yes, it's absolutely makes sense for the same variable to appear multiple times in your tree (like in the example below where we want a NO when balance is either greater than 2000 or less than 1500 and we want a YES when balance is between 1500 and 2000).
Is it possible that two splits result in the same decision?
There would really be no need to have such a split since both are giving the same result!