How to make a table from given pets to aggregate the count of the other pets?-CodePudding

I am writing a PySpark code from the following table.

user	pets
aa	[["dog"], ["cat"], ["lizard"]]
bb	[["dog"], ["spider"]]
cc	[["dog"], ["cat"], ["monkey"]]
...	...

Using explode_outer, I unnested the table to below.

user	pets
aa	"dog"
aa	"cat"
aa	"lizard"
bb	"dog"
bb	"spider"
cc	"dog"
cc	"cat"
cc	"monkey"
...	...

I would like to make another table that contains these data below.

pet	dog	cat	lizard	spider	monkey
dog	0	2	1	1	1
cat	2	0	1	0	1
lizard	1	1	0	0	0
spider	1	0	0	0	0
monkey	1	1	0	0	0

For instance, in terms of spider, it only lives with dog in the user 'bb'. So that if the data is meaningfully large, I can answer the quote, "Find top 3 pets that are the best suits with your dog."

However, I have no idea how to design appropriate code from the second table to the third one.

CodePudding user response：

Here's a solution. Hope this helps to solve your case:

import pandas as pd
import numpy as np
import itertools

df1 = pd.DataFrame({'user':['aa', 'bb', 'cc'], 'pets':[['dog', 'cat', 'lizard'], ['dog', 'spider'], ['dog', 'cat', 'monkey']]}).explode('pets')
combinations = []
for g in df1.groupby('user'):
    combinations  = [x for x in itertools.combinations(g[1].pets, 2)]
df2 = pd.DataFrame(np.zeros((5, 5), dtype=int), columns=df1.pets.unique(), index=df1.pets.unique())
for x in combinations:
    df2.at[x[0], x[1]]  = 1
    df2.at[x[1], x[0]]  = 1
print(df2)

Result:

        dog  cat  lizard  spider  monkey
dog       0    2       1       1       1
cat       2    0       1       0       1
lizard    1    1       0       0       0
spider    1    0       0       0       0
monkey    1    1       0       0       0

CodePudding user response：

The solution based only on the Spark engine. Finding all the combinations, separating them into columns and doing a crosstab. The result omits repeating entries.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('aa', [["dog"], ["cat"], ["lizard"]]),
     ('bb', [["dog"], ["spider"]]),
     ('cc', [["dog"], ["cat"], ["monkey"]])],
    ['user', 'pets'])

df = df.withColumn('pets', F.flatten('pets'))
combinations = (F.filter(
    F.transform(
        F.flatten(F.transform(
            'pets',
            lambda x: F.arrays_zip(F.array_repeat(x, F.size('pets')), 'pets')
        )),
        lambda x: F.array(x['0'], x['pets'])
    ),
    lambda x: x[0] < x[1]
))
df = df.withColumn('pets', F.explode(combinations))
df = df.withColumn('pet0', F.col('pets')[0])
df = df.withColumn('pet1', F.col('pets')[1])

df = df.crosstab("pet0", "pet1")

df.show()
#  --------- --- ------ ------ ------ 
# |pet0_pet1|dog|lizard|monkey|spider|
#  --------- --- ------ ------ ------ 
# |      dog|  0|     1|     1|     1|
# |      cat|  2|     1|     1|     0|
#  --------- --- ------ ------ ------