How can I change feature numbers listed below as outputed to its real feature names ? I want these feature names listed in the array. My algoritm is this:
Input :
rf = RandomForestRegressor(n_estimators=100, max_depth=3)
n_nodes = rf.estimators_[0].tree_.node_count
children_left = rf.estimators_[0].tree_.children_left
children_right = rf.estimators_[0].tree_.children_right
feature = rf.estimators_[0].tree_.feature
threshold = rf.estimators_[0].tree_.threshold
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)] # seed is the root node id and its parent depth
while len(stack) > 0:
node_id, parent_depth = stack.pop()
node_depth[node_id] = parent_depth 1
# If we have a test node
if (children_left[node_id] != children_right[node_id]):
stack.append((children_left[node_id], parent_depth 1))
stack.append((children_right[node_id], parent_depth 1))
else:
is_leaves[node_id] = True
Out:
For feature:
array([41, 0, 0, -2, -2, 55, -2, -2, 40, 45, -2, -2, 44, -2, -2], dtype=int64)
CodePudding user response:
You might use the property feature_names_in_
of your random forest fitted estimator to access feature names
feature_names_in_: ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
together with your feature
variable, namely rf.feature_names_in_[feature]
.
Of course you should consider that those -2 values correspond to the case where a leaf is reached, while indexing the rf.feature_names_in_
array with negative numbers won't take that into consideration. However, you might overcome the issue by first defining the indexes where feature
equals those default values
leaves = np.where(feature == -2)[0]
and exploiting them to modify the resulting array at will.
attr = rf.feature_names_in_[feature]
attr[leaves] = 'leaf'
Here's a complete example:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
iris = load_iris(as_frame=True)
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test= train_test_split(X, y, random_state=0)
rf = RandomForestClassifier(n_estimators=100, max_depth=3)
rf.fit(X_train, y_train)
n_nodes = rf.estimators_[0].tree_.node_count
children_left = rf.estimators_[0].tree_.children_left
children_right = rf.estimators_[0].tree_.children_right
feature = rf.estimators_[0].tree_.feature
threshold = rf.estimators_[0].tree_.threshold
node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, -1)] # seed is the root node id and its parent depth
while len(stack) > 0:
node_id, parent_depth = stack.pop()
node_depth[node_id] = parent_depth 1
# If we have a test node
if (children_left[node_id] != children_right[node_id]):
stack.append((children_left[node_id], parent_depth 1))
stack.append((children_right[node_id], parent_depth 1))
else:
is_leaves[node_id] = True
leaves = np.where(feature == -2)[0]
attr = rf.feature_names_in_[feature]
attr[leaves] = 'leaf'