Removing rows and columns if all zeros in non-diagonal entries-CodePudding

I am generating a confusion matrix to get an idea on my text-classifier's prediction vs ground-truth. The purpose is to understand which intents are being predicted as some another intents. But the problem is I have too many classes (more than 160), so the matrix is sparse, where most of the fields are zeros. Obviously, the diagonal elements are likely to be non-zero, as it is basically the indication of correct prediction.

That being the case, I want to generate a simpler version of it, as we only care non-zero elements if they are non-diagonal, hence, I want to remove the rows and columns where all the elements are zeros (ignoring the diagonal entries), such that the graph becomes much smaller and manageable to view. How to do that?

Following is the code snippet that I have done so far, it will produce mapping for all the intents i.e, (#intent, #intent) dimensional plot.

import matplotlib.pyplot as plt
import numpy as np 
from pandas import DataFrame
import seaborn as sns
%matplotlib inline
sns.set(rc={'figure.figsize':(64,64)})

confusion_matrix = pd.crosstab(df['ground_truth_intent_name'], df['predicted_intent_name'])

variables = sorted(list(set(df['ground_truth_intent_name'])))
temp = DataFrame(confusion_matrix, index=variables, columns=variables)

sns.heatmap(temp, annot=True)

TL;DR

Here temp is a pandas dataframe. I need to remove all rows and columns where all elements are zeros (ignoring the diagonal elements, even if they are not zero).

CodePudding user response：

You can use any on the comparison, but first you need to fill the diagonal with 0:

# also consider using
# a = np.isclose(confusion_matrix.to_numpy(), 0)
a = confusion_matrix.to_numpy() != 0

# fill diagonal
np.fill_diagonal(a, False)

# columns with at least one non-zero
cols = a.any(axis=0)

# rows with at least one non-zero
rows = a.any(axis=1)

# boolean indexing
confusion_matrix.loc[rows, cols]

Let's take an example:

# random data
np.random.seed(1)
# this would agree with the above
a = np.random.randint(0,2, (5,5))
a[2] = 0
a[:-1,-1] = 0
confusion_matrix = pd.DataFrame(a)

So the data would be:

   0  1  2  3  4
0  1  1  0  0  0
1  1  1  1  1  0
2  0  0  0  0  0
3  0  0  1  0  0
4  0  1  0  0  1

and the code outputs (notice the 2nd row and 4th column are gone):

   0  1  2  3
0  1  1  0  0
1  1  1  1  1
3  0  0  1  0
4  0  1  0  0