Home > Mobile >  grouping sub-fields from pyspark dataframe in new dataframe and sorting by linked column
grouping sub-fields from pyspark dataframe in new dataframe and sorting by linked column

Time:11-17

I have a pyspark dataframe with a column with different genres of movies. It looks like this:

|Movie Name| Genres | Review |
|X         | Y, Z   | 10     |

I need to find the top N amount of genres based on user reviews which is a column for each individual movie. I have exploded the genres column into it's own dataframe with the review column like this:

splitDf = df.withColumn("genre", explode(split(col("genre"), "[,]")))

This works to separate each genre from each movie listing. However now I need to rank them by review and I am left with duplicate rows for each different genre (from each row in the original movie df). I have tried

specifiedDf = splitDf.select("genre","user_review").groupBy("genre").avg("user_review")

and I have tried pivoting the table, but nothing seems to group the genres together so I can average the reviews.

following the advice given I could make it work in pandas with

splitDf= df.to_pandas_on_spark()
splitDf['genre'] = splitDf['genre'].str.split(',\s*')
resultDf = result.explode('genre')[['genre','user_review']].groupby('genre').agg("avg") 
resultDf = resultDf.sort_values(by="user_review", ascending=False)

however I still cant convert this to pyspark, this is the code I've mainly tinkered with

splitArrayDf = df.select(split('genre', ',').alias("genre"),"user_review")    
splitArrayDf = splitArrayDf.select(explode("genre").alias("genre"),"user_review") /
.groupBy("genre").agg({"user_review":"avg"})

this creates duplicate genre fields, which pandas does not.

CodePudding user response:

As per the example you gave in your question, you probably get duplicates genre because of the presence of spaces after and/or before the comma delimiter ,.

To handle this, you can replace them by empty string before split or simply by splitting using regex \s*,\s*:

import pyspark.sql.functions as F

data = [("X", "Y, Z", 10), ("Y", "Z, W", 7)]
df = spark.createDataFrame(data, ["movie_name", "genre", "user_review"])

df1 = df.withColumn(
    "genre",
    F.explode(F.split("genre", r"\s*,\s*"))
).groupBy("genre").agg(
    F.avg("user_review").alias("user_review")
)

df1.show()
# ----- ----------- 
#|genre|user_review|
# ----- ----------- 
#|    Y|       10.0|
#|    Z|        8.5|
#|    W|        7.0|
# ----- ----------- 

CodePudding user response:

I suggest splitting Genres by comma, but assigning the outputs to the same column. You can then explode that column and do a groupby that counts the sum of Review per genre:

import pandas as pd

data = '''Movie Name| Genres | Review
X         | Y, Z   | 10
Y         | W, Z   | 7'''

df = pd.read_csv(io.StringIO(data),sep='\s*\|\s*')
df['Genres'] = df['Genres'].str.split(',\s*')
result = df.explode('Genres')[['Genres', 'Review']].groupby('Genres').agg(sum)

Output:

Genres Review
W 7
Y 10
Z 17
  • Related