grouping sub-fields from pyspark dataframe in new dataframe and sorting by linked column-CodePudding

I have a pyspark dataframe with a column with different genres of movies. It looks like this:

|Movie Name| Genres | Review |
|X         | Y, Z   | 10     |

I need to find the top N amount of genres based on user reviews which is a column for each individual movie. I have exploded the genres column into it's own dataframe with the review column like this:

splitDf = df.withColumn("genre", explode(split(col("genre"), "[,]")))

This works to separate each genre from each movie listing. However now I need to rank them by review and I am left with duplicate rows for each different genre (from each row in the original movie df). I have tried

specifiedDf = splitDf.select("genre","user_review").groupBy("genre").avg("user_review")

and I have tried pivoting the table, but nothing seems to group the genres together so I can average the reviews.

following the advice given I could make it work in pandas with

splitDf= df.to_pandas_on_spark()
splitDf['genre'] = splitDf['genre'].str.split(',\s*')
resultDf = result.explode('genre')[['genre','user_review']].groupby('genre').agg("avg") 
resultDf = resultDf.sort_values(by="user_review", ascending=False)

however I still cant convert this to pyspark, this is the code I've mainly tinkered with

splitArrayDf = df.select(split('genre', ',').alias("genre"),"user_review")    
splitArrayDf = splitArrayDf.select(explode("genre").alias("genre"),"user_review") /
.groupBy("genre").agg({"user_review":"avg"})

this creates duplicate genre fields, which pandas does not.

CodePudding user response：

As per the example you gave in your question, you probably get duplicates genre because of the presence of spaces after and/or before the comma delimiter ,.

To handle this, you can replace them by empty string before split or simply by splitting using regex \s*,\s*:

import pyspark.sql.functions as F

data = [("X", "Y, Z", 10), ("Y", "Z, W", 7)]
df = spark.createDataFrame(data, ["movie_name", "genre", "user_review"])

df1 = df.withColumn(
    "genre",
    F.explode(F.split("genre", r"\s*,\s*"))
).groupBy("genre").agg(
    F.avg("user_review").alias("user_review")
)

df1.show()
# ----- ----------- 
#|genre|user_review|
# ----- ----------- 
#|    Y|       10.0|
#|    Z|        8.5|
#|    W|        7.0|
# ----- -----------

CodePudding user response：

I suggest splitting Genres by comma, but assigning the outputs to the same column. You can then explode that column and do a groupby that counts the sum of Review per genre:

import pandas as pd

data = '''Movie Name| Genres | Review
X         | Y, Z   | 10
Y         | W, Z   | 7'''

df = pd.read_csv(io.StringIO(data),sep='\s*\|\s*')
df['Genres'] = df['Genres'].str.split(',\s*')
result = df.explode('Genres')[['Genres', 'Review']].groupby('Genres').agg(sum)

Output:

Genres	Review
W	7
Y	10
Z	17