Home > Mobile >  pyspark.pandas API: construct co-occurrence matrix, .dot() does not support dataframe as input
pyspark.pandas API: construct co-occurrence matrix, .dot() does not support dataframe as input

Time:10-17

I'm trying to construct co-occurrence matrix of my dataframe on Databricks using pyspark.pandas API.

I tried this method to construct the matrix. Constructing a co-occurrence matrix in python pandas

The code is working fine in pandas, but is throwing error with pyspark.pandas

coocc = psdf.T.dot(psdf)
coocc

I'm getting this error

TypeError: Unsupported type DataFrame

I checked the doc. https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.dot.html

pyspark.pandas.DataFrame.dot()

Takes series as input.

I tried to converting dataframe to series using psdf.squeeze(), it does not convert dataframe to series, as my dataframe has multiple columns.

Is there any way to change pyspark.pandas.Dataframe to pyspark.pandas.Series? Or Different method in pyspark.pandas to construct cooccurrence matrix

CodePudding user response:

I solved it using csr_matrix as dataframe has '1' and '0' as values

import scipy.sparse as sp

psdfx = sp.csr_matrix(psdf.astype(int).values)
psdfc = ptdfx.T * psdfx
psdfc.setdiag(0)
coocc = ps.DataFrame(psdfc.todense(), columns=psdf.columns, index=psdf.columns)
coocc

Ref: https://stackoverflow.com/a/37840528/19642283

  • Related