I have a dataset on which I was asked to write a pyspark code for the following question.
List of Winners of Each World champions Trophy Hint: Total Result of all rounds of Tournament for that player is considered as that player's
Score/Result.
Result attributes: winner, tournament_name
I wrote this code:
game_info = spark.read.load("/content/chess/chess_wc_history_game_info.csv",
format="csv", sep=",", inferSchema="true", header="true")
game_info.groupBy('winner').show()
But on execution I got an error as:
AttributeError: 'GroupedData' object has no attribute 'show'
CodePudding user response:
This error is there because groupBy()
contains only below mentioned functions:
count()
- Returns the count of rows for each group.mean()
- Returns the mean of values for each group.max()
- Returns the maximum of values for each group.min()
- Returns the minimum of values for each group.sum()
- Returns the total for values for each group.avg()
- Returns the average for values for each group.agg()
- Usingagg()
function, we can calculate more than one aggregation at a time.pivot()
- This function is used to Pivot the DataFrame.
CodePudding user response:
I want to add another usefull function to @numb's list
collect_list
- Collects all the values for a specific column foreach group
I guess this would help to "see" the groups
side note: truncate=False in show
method print the table without truncating long text so you can actually see all the values
from pyspark.sql.functions import collect_list
game_info.groupBy('winner').agg(collect_list("<column you want to fetch>").alias('group_values')).show(truncate=False)