Total Result of all rounds of Tournament for that player is considered as that player's Score/Result.
Schema:
|-- game_id: string (nullable = true)
|-- game_order: integer (nullable = true)
|-- event: string (nullable = true)
|-- site: string (nullable = true)
|-- date_played: string (nullable = true)
|-- round: double (nullable = true)
|-- white: string (nullable = true)
|-- black: string (nullable = true)
|-- result: string (nullable = true)
|-- white_elo: integer (nullable = true)
|-- black_elo: integer (nullable = true)
|-- white_title: string (nullable = true)
|-- black_title: string (nullable = true)
|-- winner: string (nullable = true)
|-- winner_elo: integer (nullable = true)
|-- loser: string (nullable = true)
|-- loser_elo: integer (nullable = true)
|-- winner_loser_elo_diff: integer (nullable = true)
|-- eco: string (nullable = true)
|-- date_created: string (nullable = true)
|-- tournament_name: string (nullable = true)
Sample DaraFrame:
-------------------- ---------- -------- ---------- ----------- ----- ---------------- ---------------- ------- --------- --------- ----------- ----------- --------- ---------- ---------------- --------- --------------------- --- -------------------- ---------------
| game_id|game_order| event| site|date_played|round| white| black| result|white_elo|black_elo|white_title|black_title| winner|winner_elo| loser|loser_elo|winner_loser_elo_diff|eco| date_created|tournament_name|
-------------------- ---------- -------- ---------- ----------- ----- ---------------- ---------------- ------- --------- --------- ----------- ----------- --------- ---------- ---------------- --------- --------------------- --- -------------------- ---------------
|86e0b7f5-7b94-4ae...| 1|WCh 2021| Dubai UAE| 2021.11.26| 1.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|dc4a10ab-54cf-49d...| 2|WCh 2021| Dubai UAE| 2021.11.27| 2.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|E06|2022-07-22T22:33:...| WorldChamp2021|
|f042ca37-8899-488...| 3|WCh 2021| Dubai UAE| 2021.11.28| 3.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|f70e4bbc-21e3-46f...| 4|WCh 2021| Dubai UAE| 2021.11.30| 4.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
|c941c323-308a-4c8...| 5|WCh 2021| Dubai UAE| 2021.12.01| 5.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|58e83255-93bb-4d5...| 6|WCh 2021| Dubai UAE| 2021.12.03| 6.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|D02|2022-07-22T22:33:...| WorldChamp2021|
|29181d93-73f4-4fb...| 7|WCh 2021| Dubai UAE| 2021.12.04| 7.0|Nepomniachtchi,I| Carlsen,M|1/2-1/2| 2782| 2855| null| null| draw| null| draw| null| 0|C88|2022-07-22T22:33:...| WorldChamp2021|
|8a4ccd8c-d437-429...| 8|WCh 2021| Dubai UAE| 2021.12.05| 8.0| Carlsen,M|Nepomniachtchi,I| 1-0| 2855| 2782| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|C43|2022-07-22T22:33:...| WorldChamp2021|
|55a122db-27d1-495...| 9|WCh 2021| Dubai UAE| 2021.12.07| 9.0|Nepomniachtchi,I| Carlsen,M| 0-1| 2782| 2855| null| null|Carlsen,M| 2855|Nepomniachtchi,I| 2782| 73|A13|2022-07-22T22:33:...| WorldChamp2021|
|1f900d18-5ea3-4f4...| 10|WCh 2021| Dubai UAE| 2021.12.08| 10.0| Carlsen,M|Nepomniachtchi,I|1/2-1/2| 2855| 2782| null| null| draw| null| draw| null| 0|C42|2022-07-22T22:33:...| WorldChamp2021|
My code looks like this. I think it's messed up. Am I supposed to do sum somewhere?
winners = df_history_info.filter(df_history_info['winner'] != "draw").groupBy("tournament_name").agg({"winner":"max"}).show()
I'm getting this result but it is incorrect in many cases.
--------------- --------------------
|tournament_name| max(winner)|
--------------- --------------------
| WorldChamp2004| Leko,P|
| WorldChamp1894| Steinitz, William|
| WorldChamp2013| Carlsen, Magnus|
| FideChamp2000| Yermolinsky,A|
| WorldChamp2007| Svidler,P|
| FideChamp1993| Timman, Jan H|
|WorldChamp1910b| Lasker, Emanuel|
| WorldChamp1921|Capablanca, Jose ...|
| WorldChamp1958| Smyslov, Vassily|
| WorldChamp1981| Kortschnoj, Viktor|
| WorldChamp1961| Tal, Mihail|
| WorldChamp1978| Kortschnoj, Viktor|
| WorldChamp1960| Tal, Mihail|
| WorldChamp1948| Smyslov, Vassily|
| WorldChamp1929| Bogoljubow, Efim|
| WorldChamp1934| Bogoljubow, Efim|
| WorldChamp1986| Kasparov, Gary|
| PCAChamp1995| Kasparov, Gary|
| WorldChamp1886|Zukertort, Johann...|
| WorldChamp1907| Lasker, Emanuel|
--------------- --------------------
CodePudding user response:
Since the winner
column contains either the winning player's name or the word "draw"
which you've filtered out, then this means the operation .agg({"winner":"max"})
will return the max of a string. This is why Zukertort, Johann...
appears as the winner of WorldChamp1886
instead of Steinitz...
, and Yermolinksky,A
appears as in the winner in the 128 person field in the FideChamp2000
.
Here is an example of something you could try with a spark dataframe that looks like the following:
df = spark.createDataFrame(
[
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "A"),
("WC1", "A"),
("WC1", "B"),
("WC1", "A"),
("WC1", "B"),
("WC2", "F"),
("WC2", "F"),
("WC2", "F"),
("WC2", "D"),
("WC2", "D"),
("WC2", "E"),
("WC2", "F"),
("WC2", "F"),
],
["tournament_name", "winner"] # add your column names here
)
And you have a situation like this where you want to determine who wins each tournament by the most number of times their name appears in the winner column.
--------------- ------
|tournament_name|winner|
--------------- ------
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| A|
| WC1| A|
| WC1| B|
| WC1| A|
| WC1| B|
| WC2| F|
| WC2| F|
| WC2| F|
| WC2| D|
| WC2| D|
| WC2| E|
| WC2| F|
| WC2| F|
--------------- ------
You can do a groupby count on tournament_name
and winner
:
d = df.groupby(["tournament_name","winner"]).count()
And that gives you this pyspark dataframe:
--------------- ------ -----
|tournament_name|winner|count|
--------------- ------ -----
| WC1| B| 3|
| WC1| A| 5|
| WC2| F| 5|
| WC2| D| 2|
| WC2| E| 1|
--------------- ------ -----
Then following this example, you could create a WindowSpec
object that partitions by tournament_name
, and sorts by in descending order of the count
column, and apply it to d:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("tournament_name").orderBy(col("count").desc())
d.withColumn("row",row_number().over(windowDept)) \
.filter(col("row") == 1).drop("row") \
.show()
Final result:
--------------- ------ -----
|tournament_name|winner|count|
--------------- ------ -----
| WC1| A| 5|
| WC2| F| 5|
--------------- ------ -----