I'm writing a function to put data from a chess website into a dataframe. The code I was able to write almost got the job done but not quite.
import chessdotcom
import pandas as pd
import regex as re
import json
from io import StringIO
#
def cleandata(datacall):
data = datacall.text
x = json.loads(data)
df = pd.read_json(StringIO(json.dumps(x)))
df2 = pd.json_normalize(df['games'])
df2.to_csv(r'chessdataframegamesv3.csv')
#
datacall = chessdotcom.client.get_player_games_by_month("Player1", 2022, 7)
cleandata(datacall)
This produced a csv file that had the columns url, pgn, time_control, etc. The pgn column is the one that I'm having trouble working with. It has a lot of information for each row. Just one row has Event: Live Chess, Site "Chess.com", Date "2022.07.01", Round "-", White "Player2",Black "Player1",Result "1-0",CurrentPosition "6Q1/5K1k/8/2p5/1p6/p7/1PN5/8 b - -",Timezone "UTC", ECO "B18", ECOUrl "https://www.chess.com/openings/Caro-Kann-Defense-Classical-Variation-5.Ng3-Bg6-6.Nf3-Nd7-7.Bd3", UTCDate "2022.07.01", UTCTime "13:37:42", WhiteElo "1611", BlackElo "1616", TimeControl "180 2", Termination "Player2 won on time", StartTime "13:37:42", EndDate "2022.07.01", EndTime "13:47:01", Link "https://www.chess.com/game/live/50421603881". These are each in square brackets. These are also followed by the moves played in the game
Ideally, this information should for the most part be separate columns. So there should be a column "Event" with value "Live Chess" for example. Also, it would be good to have a separate row for each move, in a "moves" column. However, having all the moves in the moves column would also be acceptable. Does anyone know how to do this? This is my first question here. So I hope I'm clear enough. Thanks.
CodePudding user response:
I suggest to use python-chess for pgn parsing:
pip install chess
Then you can use it as follows:
import json
from io import StringIO
import chess.pgn
from chessdotcom import client
import pandas as pd
def read_pgn(pgn_str):
game = chess.pgn.read_game(StringIO(pgn_str))
headers = game.headers
headers['moves'] = game.mainline_moves()
return headers
# "erik" is taken as an example from here: https://www.chess.com/news/view/published-data-api
player, year, month = "erik", 2022, 7
r = client.get_player_games_by_month(player, year, month)
games = json.loads(r.text)['games']
df1 = pd.json_normalize(games)
df2 = pd.DataFrame(list(df1.pgn.apply(read_pgn)))
pd.concat([df1, df2], axis=1).drop(columns='pgn')
This will give you the following dataframe:
url time_control end_time rated ... Variant White WhiteElo moves
0 https://www.chess.com/game/daily/410529597 1/259200 1656675306 False ... NaN TheMsquare 1592 3. Nc3 { [%clk 69:37:31] } 3... Bg7 { [%clk 71...
1 https://www.chess.com/game/daily/387673189 1/604800 1656675605 True ... Chess960 erik 1434 1. b3 { [%clk 167:59:49] } 1... Ne6 { [%clk 16...
2 https://www.chess.com/game/daily/410866161 1/259200 1656679900 True ... NaN Lee 1356 1. e4 { [%clk 62:04:50] } 1... d6 { [%clk 71:3...
3 https://www.chess.com/game/live/50449822647 60 1656711104 True ... NaN rexzs 1744 1. e3 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
4 https://www.chess.com/game/live/50450421549 60 1656711671 True ... NaN erik 1823 1. e4 { [%clk 0:01:00] } 1... c6 { [%clk 0:01:...
.. ... ... ... ... ... ... ... ... ...
214 https://www.chess.com/game/live/52974588839 60 1659235731 True ... NaN erik 1741 1. e4 { [%clk 0:01:00] } 1... e6 { [%clk 0:01:...
215 https://www.chess.com/game/live/52975772965 60 1659236887 True ... NaN GeneralCoin 1672 1. d4 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
216 https://www.chess.com/game/live/52976977887 60 1659238132 True ... NaN MarcosP99 1772 1. e4 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
217 https://www.chess.com/game/live/52977007497 60 1659238338 True ... NaN erik 1730 1. e4 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...
218 https://www.chess.com/game/live/53060949585 60 1659321932 True ... NaN erik 1724 1. e4 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...
[219 rows x 50 columns]