I'm trying to construct co-occurrence matrix of my dataframe on Databricks using pyspark.pandas API.
I tried this method to construct the matrix. Constructing a co-occurrence matrix in python pandas
The code is working fine in pandas, but is throwing error with pyspark.pandas
coocc = psdf.T.dot(psdf)
coocc
I'm getting this error
TypeError: Unsupported type DataFrame
I checked the doc. https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.dot.html
pyspark.pandas.DataFrame.dot()
Takes series as input.
I tried to converting dataframe to series using psdf.squeeze()
, it does not convert dataframe to series, as my dataframe has multiple columns.
Is there any way to change pyspark.pandas.Dataframe
to pyspark.pandas.Series
?
Or Different method in pyspark.pandas to construct cooccurrence matrix
CodePudding user response:
I solved it using csr_matrix
as dataframe has '1' and '0' as values
import scipy.sparse as sp
psdfx = sp.csr_matrix(psdf.astype(int).values)
psdfc = ptdfx.T * psdfx
psdfc.setdiag(0)
coocc = ps.DataFrame(psdfc.todense(), columns=psdf.columns, index=psdf.columns)
coocc