I am trying to learn about recommender systems in Python by reading a blog that contains a great example of creating a recommender system of repositories in GitHub.
Once the dataset is loaded with read_csv(), the person that wrote the code decided to convert that data into a pivot_table pandas for visualizing the data in a more simple way. Here, I left you an image of that part of the code for simplicity:
In that table, rows are the users and columns are the repositories. The cross section between a row and a column is the punctuation that a user gives to a particular repository.
Due to the fact that many of the elements of that table are null (we can say that we are having a sparse matrix, very typical in machine learning), he decided to study the level of sparsity of the matrix by means of this code:
ratings = df_matrix.values
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('Sparsity: {:4.2f}%'.format(sparsity))
Could anyone help me to know what the second line of code means? I think I understand that ratings.nonzero() returns a list with the indexes of all the elements that are different from zero and, as I interested in the total numbers and not the indexes, it is necessary to use len(ratings.nonzero()), but my problem is that it is impossible to me to know what the [0] means in the code.
Thank you very much and sorry for the inconvenience!
CodePudding user response:
By default, nonzero
will return a tuple of the form (row_idxs, col_idxs)
. If you hand it a one-dimensional array (like a pandas series), then it will still return a tuple, (row_idxs,)
. To access this first array, we still must index ratings.nonzero()[0]
to get the first-dimension index of nonzero elements.
More info available on the numpy page for nonzero
here, as both pandas and numpy use the same implementation.