Home > Enterprise >  What does .nonzero()[0] mean when we want to compute the sparsity of a matrix?
What does .nonzero()[0] mean when we want to compute the sparsity of a matrix?

Time:10-19

I am trying to learn about recommender systems in Python by reading a blog that contains a great example of creating a recommender system of repositories in GitHub.

Once the dataset is loaded with read_csv(), the person that wrote the code decided to convert that data into a pivot_table pandas for visualizing the data in a more simple way. Here, I left you an image of that part of the code for simplicity:

enter image description here

In that table, rows are the users and columns are the repositories. The cross section between a row and a column is the punctuation that a user gives to a particular repository.

Due to the fact that many of the elements of that table are null (we can say that we are having a sparse matrix, very typical in machine learning), he decided to study the level of sparsity of the matrix by means of this code:

ratings = df_matrix.values
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('Sparsity: {:4.2f}%'.format(sparsity))

Could anyone help me to know what the second line of code means? I think I understand that ratings.nonzero() returns a list with the indexes of all the elements that are different from zero and, as I interested in the total numbers and not the indexes, it is necessary to use len(ratings.nonzero()), but my problem is that it is impossible to me to know what the [0] means in the code.

Thank you very much and sorry for the inconvenience!

CodePudding user response:

By default, nonzero will return a tuple of the form (row_idxs, col_idxs). If you hand it a one-dimensional array (like a pandas series), then it will still return a tuple, (row_idxs,). To access this first array, we still must index ratings.nonzero()[0] to get the first-dimension index of nonzero elements.

More info available on the numpy page for nonzero here, as both pandas and numpy use the same implementation.

  • Related