Home > OS >  Tensorflow decision forest custom metric vs. number of trees
Tensorflow decision forest custom metric vs. number of trees

Time:10-26

I have created a classification model using tensorflow decision forests. I'm struggling to evaluate how the performance changes vs. number of trees for a non-default metric (in this case PR-AUC).

Below is some code with my attempts.

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import tensorflow_decision_forests as tfdf

train = load_diabetes()
X = pd.DataFrame(train['data'])
X['target'] = (pd.Series(train['target']) > 100).astype(int)
X_train, X_test = train_test_split(X)
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_train, label="target")   
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(X_test, label="target")   
pr_auc = tf.keras.metrics.AUC( curve='PR',)
tfdf_clf = tfdf.keras.GradientBoostedTreesModel()
tfdf_clf.compile(metrics=[pr_auc])
tfdf_clf.fit(train_ds, validation_data=test_ds,)

Now I get very useful training logs using

tfdf_clf.make_inspector().training_logs()
#[TrainLog(num_trees=1, evaluation=Evaluation(num_examples=None, accuracy=0.9005518555641174, loss=0.6005926132202148, rmse=None, ndcg=None, aucs=None)),
#TrainLog(num_trees=2, evaluation=Evaluation(num_examples=None, accuracy=0.9005518555641174, loss=0.5672071576118469, rmse=None, ndcg=None, aucs=None)),

But it doesn't contain any info on PR-AUC vs. iterations

If I evaluate the model, it only persists PR-AUC at the end of training, although it seens to log some intermediate info.

tfdf_clf.evaluate(test_ds)

1180/1180 [==============================] - 10s 8ms/step - loss: 0.0000e 00 - auc: 0.6832

How can I find how test-data PR-AUC changes vs. number of trees? I need to specifically use tensforflow decision forest library.

CodePudding user response:

Plot the AUPRC. Area under the interpolated precision-recall curve, obtained by plotting (recall, precision) points for different values of the classification threshold. Depending on how it's calculated, PR AUC may be equivalent to the average precision of the model.It looks like the precision is relatively high, but the recall and the area under the ROC curve (AUC) aren't as high as you might like. Classifiers often face challenges when trying to maximize both precision and recall, which is especially true when working with imbalanced datasets. It is important to consider the costs of different types of errors in the context of the problem you care about. In this example, a false negative (a fraudulent transaction is missed) may have a financial cost, while a false positive (a transaction is incorrectly flagged as fraudulent) may decrease user happiness.

In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, i.e. at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees. Random forests are ensemble methods, and you average over many trees. Similarly, if you want to estimate an average of a real-valued random variable (e.g. the average heigth of a citizen in your country) you can take a sample. The expected variance will decrease as the square root of the sample size, and at a certain point the cost of collecting a larger sample will be higher than the benefit in accuracy obtained from such larger sample. In your case you observe that in a single experiment on a single test set a forest of 10 trees performs better than a forest of 500 trees. This may be due to statistical variance. If this would happen systematically, I would hypothesize that there is something wrong with the implementation. Typical values for the number of trees is 10, 30 or 100. I think in only very few practical cases more than 300 trees outweights the cost of learning them (well, except maybe if you have a really huge dataset).

CodePudding user response:

The PR-AUC metric is not supported for Gradient Boosted Trees. However, all of the metrics are available for Random Forest. You'd need to convert your training data to a format with the same structure as the test data, run it through a gradient boosted trees model trained on train_ds and evaluated with test_ds via train_ds.eval() .

The reason Gradient Boosted Trees don't have the PR-AUC metric is that they are trained in a different way than Random Forest. They are not regressor, so it wouldn't make sense to return a probability estimate of being positive. Instead, they return just an average class label prediction across all the trees for each test example, with a ranking of labels. These rankings are used to calculate the aggregated metrics via the AggregatedMetrics API. Note that it averages all predictions across all trees during training, so there is no parameter to control which number of samples are used for evaluation purposes.

A better way to evaluate these kinds of models is not with a human metric like PR-AUC, but instead use the auto metrics that are built into Tensorflow. This is because they account for model size (smaller models can sometimes be statistically significant but end up overfitting heavily due to their small size), and also allow you to choose how many samples are used in the evaluation (which can be different than the training set).

  • Related