I already have a normalized dataset:
df = spark.createDataFrame([('red apple', 'ripe banana', 0.3),
('red apple', 'hot pepper', 0.4),
('red apple','sweet kiwi', 0.5),
('ripe banana','hot pepper', 0.6),
('ripe banana','sweet kiwi', 0.7),
('hot pepper','sweet kiwi', 0.8)], ["phrase1", "phrase2", 'common_persent'])
df.show()
----------- ----------- --------------
| phrase1| phrase2|common_persent|
----------- ----------- --------------
| red apple|ripe banana| 0.3|
| red apple| hot pepper| 0.4|
| red apple| sweet kiwi| 0.5|
|ripe banana| hot pepper| 0.6|
|ripe banana| sweet kiwi| 0.7|
| hot pepper| sweet kiwi| 0.8|
----------- ----------- --------------
I want to create a matrix of similarity using pyspark. The result should look like this:
----------- --------- ----------- ---------- ----------
| phrases|red apple|ripe banana|hot pepper|sweet kiwi|
----------- --------- ----------- ---------- ----------
| red apple| 1.0| 0.3| 0.4| 0.5|
|ripe banana| 0.3| 1.0| 0.6| 0.7|
| hot pepper| 0.4| 0.6| 1.0| 0.8|
| sweet kiwi| 0.5| 0.7| 0.8| 1.0|
----------- --------- ----------- ---------- ----------
Thus, there are units on the diagonal, and above / below the values from the columns average
. My problem is that, I don't quite understand how to make the units on the diagonal.
CodePudding user response:
You can simply pivot your dataframe.
from pyspark.sql import functions as f
df.groupBy('phrase1').pivot('phrase2').agg(f.first('common_persent')).show()
----------- ---------- ----------- ----------
| phrase1|hot pepper|ripe banana|sweet kiwi|
----------- ---------- ----------- ----------
| red apple| 0.4| 0.3| 0.5|
|ripe banana| 0.6| null| 0.7|
| hot pepper| null| null| 0.8|
----------- ---------- ----------- ----------
CodePudding user response:
To transform your dataframe into a similarity matrix, you have to follow several steps:
- first, create the list of permutations of phrase1/phrase2 by creating an array and then using
explode
on this array - then, pivot your dataframe as explained in Lamanus' answer
- finally, fill
null
values with1.0
usingfillna
- optionally, reorder your
phrases
column
The complete code is as follows:
from pyspark.sql import functions as F
result = df.withColumn(
'permutations',
F.array(
F.struct(F.col('phrase1').alias('first'), F.col('phrase2').alias('second')),
F.struct(F.col('phrase2').alias('first'), F.col('phrase1').alias('second'))
)
) \
.drop('phrase1', 'phrase2') \
.select(F.explode('permutations'), F.col('common_persent')) \
.select("col.*", "common_persent") \
.groupBy(F.col('first').alias('phrases')) \
.pivot('second') \
.agg(F.first('common_persent')) \
.fillna(1.0) \
.orderBy('phrases')
With your df
dataframe, it gives you the following result
dataframe:
----------- ---------- --------- ----------- ----------
|phrases |hot pepper|red apple|ripe banana|sweet kiwi|
----------- ---------- --------- ----------- ----------
|hot pepper |1.0 |0.4 |0.6 |0.8 |
|red apple |0.4 |1.0 |0.3 |0.5 |
|ripe banana|0.6 |0.3 |1.0 |0.7 |
|sweet kiwi |0.8 |0.5 |0.7 |1.0 |
----------- ---------- --------- ----------- ----------