Home > Software design >  How to transform a sequential string of chess moves into a vertical dataframe?
How to transform a sequential string of chess moves into a vertical dataframe?

Time:06-16

I have a dataframe of chess games with two columns as shown below

dd <- data.frame(
  game_id = c(101,102),
  moves = c("1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7","1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6")  
)

Here each row is a separate game uniquely identified by the game id. The moves column contain all the moves of a game in sequential order from left to right. The serial number of the move can be identified by the number just before each dot ".". Each move has two parts; the first part is always the move by the white player followed by the second part which is the move by the black player. The two parts are separated by a single space. As shown in the above data, two consecutive moves are also separated by a single space, however, there is no gap between the dot of the serial number and the first character of the white player's move. The total number of moves in a game is arbitrary as some games end in a few moves while others may have many moves.

Question: As we can see all the moves of a game are present in one single cell of the dataframe which is not very easy for analysis. I want to convert this to a dataframe with a better structure as shown below:

game_id  | move_no | white | black
----------------------------------
    101  | 1       | e4    | c5
    101  | 2       | Nf3   | d6
    101  | 3       | d4    | cxd4
    101  | 4       | Nxd4  | Nf6 

How can this be done in R?

CodePudding user response:

We can splot the move string with a regular expression. Here I've used stringr::str_match_all to capture each part of the moves.

dd$moves |>
  stringr::str_match_all(r"{(\d )\.(\S ) (\S )}") |>
  lapply(function(x) data.frame(move_id=as.numeric(x[,2]), white=x[,3], black=x[,4])) |> 
  Map(cbind.data.frame, game_id=dd$game_id, m=_) |>
  do.call("rbind", args=_)

which will return

   game_id m.move_id m.white m.black
1      101         1      e4      c5
2      101         2     Nf3      d6
3      101         3      d4    cxd4
4      101         4    Nxd4     Nf6
5      101         5     Nc3     Nc6
6      101         6     Bc4      e6
7      101         7     Be3     Be7
8      102         1      e3      c5
9      102         2     Nf3     Nc6

The main part is the regular expression r"{(\d )\.(\S ) (\S )}" which finds a number followed by a period, then tries to find two non-space-containing piece names with a space between them.

CodePudding user response:

Here's a very inelegant answer using the stringr, dplyr, and tidyr libraries.

First we split the character vector of moves into data.frame per each move using str_split_fixed from stringr. Then we append these columns to the existing data.frame.

Next we reshape the data using dplyr and tidyr. First, we pivot the "move" columns from wide to long form data. Then we infer the color of the "move" by the order given above, as well as the number of the move by the number of movement columns.

Then we pivot the newly created movement data by color. From there, we modify the moves are recorded so that one "move" corresponds to one "turn" (aka a move of each color). Then we group by move number and game type, and do a small clean up from leftover null columns from our pivot.

Not the most elegant answer in the world but it works?

#load libs
library(stringr)
library(dplyr)
library(tidyr)

#create sample data
chess = data.frame(
  game_id = c(101,102),
  moves = c("1.e4 2Nf3 d6 Nf6","1.e3 2Nf4 nc6 cxd4")
)

#split characters vectors to columns
moves = data.frame(str_split_fixed(chess$moves," ",Inf)) #returns a matrix by default so coerce to data.frame to make binding easier
chess = bind_cols(chess,moves) #bind the new data

#melt the data into long form
chess %>%
  select(game_id,X1:X4) %>% #here we avoid selecting the default col we don't need
  pivot_longer(cols = X1:X4,names_to="move_no") %>% #melt/pivot split columns to one column
  mutate(color = rep(c("white","black"),ncol(chess)-2)) %>% #infer color by every other move, infer length by n of movement columns
  pivot_wider(names_from=color,values_from=value) %>% #pivot moves by color wider
  mutate(move_no = sort(rep(c(1:(ncol(chess)-2)),2))) %>% #update move_no colum to be accurate, note we use sort to get the needed order of moves
  group_by(game_id,move_no) %>% 
  summarise(
    white = last(na.omit(white)), #last will return the last non NA value, getting us to a valid data.frame
    black = last(na.omit(black))
  )

CodePudding user response:

with base R

Data

x <- "
game_id  | moves
101      | 1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7 
102      | 1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6 
"
df <- read.table(textConnection(x) , header = T , sep = "|")

using fn function


fn <- function(df) {
  lst <- list()
  id <- 1 ; L <- 1
  clmn <- strsplit(trimws(df$moves) , "[. ]")
  for (i in clmn) {
    for (j in 1:(length(i) / 3)) {
      j <- 3*j - 2
      lst[[id]] <- c(df$game_id[[L]] , clmn[[L]][j:(j   2)])
      id <- id   1
    }
    L <- L   1
  }
  lst
}
#===================================

df <- data.frame(do.call(rbind , fn(df)))
colnames(df) <- c("game_id" , "move_no" , "white" , "black")

Output

df
#>    game_id move_no white black
#> 1      101       1    e4    c5
#> 2      101       2   Nf3    d6
#> 3      101       3    d4  cxd4
#> 4      101       4  Nxd4   Nf6
#> 5      101       5   Nc3   Nc6
#> 6      101       6   Bc4    e6
#> 7      101       7   Be3   Be7
#> 8      102       1    e3    c5
#> 9      102       2   Nf3   Nc6
#> 10     102       3    d4  cxd4
#> 11     102       4  Nxd4   Nf6
#> 12     102       5   Nc3    e5
#> 13     102       6  Ndb5    d6

Created on 2022-06-16 by the reprex package (v2.0.1)

CodePudding user response:

dd <- tibble(
  game_id = c(101, 102),
  moves = c(
    "1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7",
    "1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6"
  )
)

pattern <- r"{(\d )\.(\S ) (\S )}"
dd |>
  mutate(moves = str_extract_all(moves, pattern)) |>
  unnest(moves) |>
  mutate(
    move_no = str_replace(moves, pattern, "\\1"),
    white = str_replace(moves, pattern, "\\2"),
    black = str_replace(moves, pattern, "\\3")
  ) |>
  select(-moves)
  • Related