I am generating a confusion matrix
to get an idea on my text-classifier
's prediction
vs ground-truth
. The purpose is to understand which intent
s are being predicted as some another intent
s. But the problem is I have too many classes (more than 160
), so the matrix is sparse
, where most of the fields are zeros
. Obviously, the diagonal elements are likely to be non-zero, as it is basically the indication of correct prediction.
That being the case, I want to generate a simpler version of it, as we only care non-zero
elements if they are non-diagonal
, hence, I want to remove the row
s and column
s where all the elements are zeros (ignoring the diagonal
entries), such that the graph becomes much smaller and manageable to view. How to do that?
Following is the code snippet that I have done so far, it will produce mapping for all the intents i.e, (#intent, #intent)
dimensional plot.
import matplotlib.pyplot as plt
import numpy as np
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(64,64)})
confusion_matrix = pd.crosstab(df['ground_truth_intent_name'], df['predicted_intent_name'])
variables = sorted(list(set(df['ground_truth_intent_name'])))
temp = DataFrame(confusion_matrix, index=variables, columns=variables)
sns.heatmap(temp, annot=True)
TL;DR
Here temp
is a pandas dataframe
. I need to remove all rows and columns where all elements are zeros (ignoring the diagonal elements, even if they are not zero).
CodePudding user response:
You can use any
on the comparison, but first you need to fill the diagonal with 0
:
# also consider using
# a = np.isclose(confusion_matrix.to_numpy(), 0)
a = confusion_matrix.to_numpy() != 0
# fill diagonal
np.fill_diagonal(a, False)
# columns with at least one non-zero
cols = a.any(axis=0)
# rows with at least one non-zero
rows = a.any(axis=1)
# boolean indexing
confusion_matrix.loc[rows, cols]
Let's take an example:
# random data
np.random.seed(1)
# this would agree with the above
a = np.random.randint(0,2, (5,5))
a[2] = 0
a[:-1,-1] = 0
confusion_matrix = pd.DataFrame(a)
So the data would be:
0 1 2 3 4
0 1 1 0 0 0
1 1 1 1 1 0
2 0 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 1
and the code outputs (notice the 2nd row and 4th column are gone):
0 1 2 3
0 1 1 0 0
1 1 1 1 1
3 0 0 1 0
4 0 1 0 0