Home > front end >  Parsing unusual data (chess data) in a dataframe column
Parsing unusual data (chess data) in a dataframe column

Time:08-13

I'm writing a function to put data from a chess website into a dataframe. The code I was able to write almost got the job done but not quite.

import chessdotcom
import pandas as pd
import regex as re
import json
from io import StringIO
#
def cleandata(datacall):
  data = datacall.text
  x = json.loads(data)
  df = pd.read_json(StringIO(json.dumps(x)))
  df2 = pd.json_normalize(df['games'])
  df2.to_csv(r'chessdataframegamesv3.csv')
#
datacall = chessdotcom.client.get_player_games_by_month("Player1", 2022, 7)
cleandata(datacall)

This produced a csv file that had the columns url, pgn, time_control, etc. The pgn column is the one that I'm having trouble working with. It has a lot of information for each row. Just one row has Event: Live Chess, Site "Chess.com", Date "2022.07.01", Round "-", White "Player2",Black "Player1",Result "1-0",CurrentPosition "6Q1/5K1k/8/2p5/1p6/p7/1PN5/8 b - -",Timezone "UTC", ECO "B18", ECOUrl "https://www.chess.com/openings/Caro-Kann-Defense-Classical-Variation-5.Ng3-Bg6-6.Nf3-Nd7-7.Bd3", UTCDate "2022.07.01", UTCTime "13:37:42", WhiteElo "1611", BlackElo "1616", TimeControl "180 2", Termination "Player2 won on time", StartTime "13:37:42", EndDate "2022.07.01", EndTime "13:47:01", Link "https://www.chess.com/game/live/50421603881". These are each in square brackets. These are also followed by the moves played in the game

Ideally, this information should for the most part be separate columns. So there should be a column "Event" with value "Live Chess" for example. Also, it would be good to have a separate row for each move, in a "moves" column. However, having all the moves in the moves column would also be acceptable. Does anyone know how to do this? This is my first question here. So I hope I'm clear enough. Thanks.

CodePudding user response:

I suggest to use python-chess for pgn parsing:

pip install chess

Then you can use it as follows:

import json
from io import StringIO

import chess.pgn
from chessdotcom import client
import pandas as pd

def read_pgn(pgn_str):
    game = chess.pgn.read_game(StringIO(pgn_str))
    headers = game.headers
    headers['moves'] = game.mainline_moves()
    return headers

# "erik" is taken as an example from here: https://www.chess.com/news/view/published-data-api
player, year, month = "erik", 2022, 7
r = client.get_player_games_by_month(player, year, month)

games = json.loads(r.text)['games']
df1 = pd.json_normalize(games)
df2 = pd.DataFrame(list(df1.pgn.apply(read_pgn)))
pd.concat([df1, df2], axis=1).drop(columns='pgn')

This will give you the following dataframe:

                                             url time_control    end_time  rated  ...   Variant        White WhiteElo                                              moves
0     https://www.chess.com/game/daily/410529597     1/259200  1656675306  False  ...       NaN   TheMsquare     1592  3. Nc3 { [%clk 69:37:31] } 3... Bg7 { [%clk 71...
1     https://www.chess.com/game/daily/387673189     1/604800  1656675605   True  ...  Chess960         erik     1434  1. b3 { [%clk 167:59:49] } 1... Ne6 { [%clk 16...
2     https://www.chess.com/game/daily/410866161     1/259200  1656679900   True  ...       NaN          Lee     1356  1. e4 { [%clk 62:04:50] } 1... d6 { [%clk 71:3...
3    https://www.chess.com/game/live/50449822647           60  1656711104   True  ...       NaN        rexzs     1744  1. e3 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
4    https://www.chess.com/game/live/50450421549           60  1656711671   True  ...       NaN         erik     1823  1. e4 { [%clk 0:01:00] } 1... c6 { [%clk 0:01:...
..                                           ...          ...         ...    ...  ...       ...          ...      ...                                                ...
214  https://www.chess.com/game/live/52974588839           60  1659235731   True  ...       NaN         erik     1741  1. e4 { [%clk 0:01:00] } 1... e6 { [%clk 0:01:...
215  https://www.chess.com/game/live/52975772965           60  1659236887   True  ...       NaN  GeneralCoin     1672  1. d4 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
216  https://www.chess.com/game/live/52976977887           60  1659238132   True  ...       NaN    MarcosP99     1772  1. e4 { [%clk 0:01:00] } 1... d6 { [%clk 0:01:...
217  https://www.chess.com/game/live/52977007497           60  1659238338   True  ...       NaN         erik     1730  1. e4 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...
218  https://www.chess.com/game/live/53060949585           60  1659321932   True  ...       NaN         erik     1724  1. e4 { [%clk 0:01:00] } 1... c5 { [%clk 0:01:...

[219 rows x 50 columns]
  • Related