I have a pyspark dataframe with a column with different genres of movies. It looks like this:
|Movie Name| Genres | Review |
|X | Y, Z | 10 |
I need to find the top N amount of genres based on user reviews which is a column for each individual movie. I have exploded the genres column into it's own dataframe with the review column like this:
splitDf = df.withColumn("genre", explode(split(col("genre"), "[,]")))
This works to separate each genre from each movie listing. However now I need to rank them by review and I am left with duplicate rows for each different genre (from each row in the original movie df). I have tried
specifiedDf = splitDf.select("genre","user_review").groupBy("genre").avg("user_review")
and I have tried pivoting the table, but nothing seems to group the genres together so I can average the reviews.
following the advice given I could make it work in pandas with
splitDf= df.to_pandas_on_spark()
splitDf['genre'] = splitDf['genre'].str.split(',\s*')
resultDf = result.explode('genre')[['genre','user_review']].groupby('genre').agg("avg")
resultDf = resultDf.sort_values(by="user_review", ascending=False)
however I still cant convert this to pyspark, this is the code I've mainly tinkered with
splitArrayDf = df.select(split('genre', ',').alias("genre"),"user_review")
splitArrayDf = splitArrayDf.select(explode("genre").alias("genre"),"user_review") /
.groupBy("genre").agg({"user_review":"avg"})
this creates duplicate genre fields, which pandas does not.
CodePudding user response:
As per the example you gave in your question, you probably get duplicates genre
because of the presence of spaces after and/or before the comma delimiter ,
.
To handle this, you can replace them by empty string before split or simply by splitting using regex \s*,\s*
:
import pyspark.sql.functions as F
data = [("X", "Y, Z", 10), ("Y", "Z, W", 7)]
df = spark.createDataFrame(data, ["movie_name", "genre", "user_review"])
df1 = df.withColumn(
"genre",
F.explode(F.split("genre", r"\s*,\s*"))
).groupBy("genre").agg(
F.avg("user_review").alias("user_review")
)
df1.show()
# ----- -----------
#|genre|user_review|
# ----- -----------
#| Y| 10.0|
#| Z| 8.5|
#| W| 7.0|
# ----- -----------
CodePudding user response:
I suggest splitting Genres
by comma, but assigning the outputs to the same column. You can then explode that column and do a groupby that counts the sum of Review
per genre:
import pandas as pd
data = '''Movie Name| Genres | Review
X | Y, Z | 10
Y | W, Z | 7'''
df = pd.read_csv(io.StringIO(data),sep='\s*\|\s*')
df['Genres'] = df['Genres'].str.split(',\s*')
result = df.explode('Genres')[['Genres', 'Review']].groupby('Genres').agg(sum)
Output:
Genres | Review |
---|---|
W | 7 |
Y | 10 |
Z | 17 |