I have a dataframe of chess games with two columns as shown below
dd <- data.frame(
game_id = c(101,102),
moves = c("1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7","1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6")
)
Here each row is a separate game uniquely identified by the game id. The moves column contain all the moves of a game in sequential order from left to right. The serial number of the move can be identified by the number just before each dot ".". Each move has two parts; the first part is always the move by the white player followed by the second part which is the move by the black player. The two parts are separated by a single space. As shown in the above data, two consecutive moves are also separated by a single space, however, there is no gap between the dot of the serial number and the first character of the white player's move. The total number of moves in a game is arbitrary as some games end in a few moves while others may have many moves.
Question: As we can see all the moves of a game are present in one single cell of the dataframe which is not very easy for analysis. I want to convert this to a dataframe with a better structure as shown below:
game_id | move_no | white | black
----------------------------------
101 | 1 | e4 | c5
101 | 2 | Nf3 | d6
101 | 3 | d4 | cxd4
101 | 4 | Nxd4 | Nf6
How can this be done in R?
CodePudding user response:
We can splot the move string with a regular expression. Here I've used stringr::str_match_all
to capture each part of the moves.
dd$moves |>
stringr::str_match_all(r"{(\d )\.(\S ) (\S )}") |>
lapply(function(x) data.frame(move_id=as.numeric(x[,2]), white=x[,3], black=x[,4])) |>
Map(cbind.data.frame, game_id=dd$game_id, m=_) |>
do.call("rbind", args=_)
which will return
game_id m.move_id m.white m.black
1 101 1 e4 c5
2 101 2 Nf3 d6
3 101 3 d4 cxd4
4 101 4 Nxd4 Nf6
5 101 5 Nc3 Nc6
6 101 6 Bc4 e6
7 101 7 Be3 Be7
8 102 1 e3 c5
9 102 2 Nf3 Nc6
The main part is the regular expression r"{(\d )\.(\S ) (\S )}"
which finds a number followed by a period, then tries to find two non-space-containing piece names with a space between them.
CodePudding user response:
Here's a very inelegant answer using the stringr, dplyr, and tidyr libraries.
First we split the character vector of moves into data.frame per each move using str_split_fixed from stringr. Then we append these columns to the existing data.frame.
Next we reshape the data using dplyr and tidyr. First, we pivot the "move" columns from wide to long form data. Then we infer the color of the "move" by the order given above, as well as the number of the move by the number of movement columns.
Then we pivot the newly created movement data by color. From there, we modify the moves are recorded so that one "move" corresponds to one "turn" (aka a move of each color). Then we group by move number and game type, and do a small clean up from leftover null columns from our pivot.
Not the most elegant answer in the world but it works?
#load libs
library(stringr)
library(dplyr)
library(tidyr)
#create sample data
chess = data.frame(
game_id = c(101,102),
moves = c("1.e4 2Nf3 d6 Nf6","1.e3 2Nf4 nc6 cxd4")
)
#split characters vectors to columns
moves = data.frame(str_split_fixed(chess$moves," ",Inf)) #returns a matrix by default so coerce to data.frame to make binding easier
chess = bind_cols(chess,moves) #bind the new data
#melt the data into long form
chess %>%
select(game_id,X1:X4) %>% #here we avoid selecting the default col we don't need
pivot_longer(cols = X1:X4,names_to="move_no") %>% #melt/pivot split columns to one column
mutate(color = rep(c("white","black"),ncol(chess)-2)) %>% #infer color by every other move, infer length by n of movement columns
pivot_wider(names_from=color,values_from=value) %>% #pivot moves by color wider
mutate(move_no = sort(rep(c(1:(ncol(chess)-2)),2))) %>% #update move_no colum to be accurate, note we use sort to get the needed order of moves
group_by(game_id,move_no) %>%
summarise(
white = last(na.omit(white)), #last will return the last non NA value, getting us to a valid data.frame
black = last(na.omit(black))
)
CodePudding user response:
with base R
Data
x <- "
game_id | moves
101 | 1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7
102 | 1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6
"
df <- read.table(textConnection(x) , header = T , sep = "|")
using fn
function
fn <- function(df) {
lst <- list()
id <- 1 ; L <- 1
clmn <- strsplit(trimws(df$moves) , "[. ]")
for (i in clmn) {
for (j in 1:(length(i) / 3)) {
j <- 3*j - 2
lst[[id]] <- c(df$game_id[[L]] , clmn[[L]][j:(j 2)])
id <- id 1
}
L <- L 1
}
lst
}
#===================================
df <- data.frame(do.call(rbind , fn(df)))
colnames(df) <- c("game_id" , "move_no" , "white" , "black")
Output
df
#> game_id move_no white black
#> 1 101 1 e4 c5
#> 2 101 2 Nf3 d6
#> 3 101 3 d4 cxd4
#> 4 101 4 Nxd4 Nf6
#> 5 101 5 Nc3 Nc6
#> 6 101 6 Bc4 e6
#> 7 101 7 Be3 Be7
#> 8 102 1 e3 c5
#> 9 102 2 Nf3 Nc6
#> 10 102 3 d4 cxd4
#> 11 102 4 Nxd4 Nf6
#> 12 102 5 Nc3 e5
#> 13 102 6 Ndb5 d6
Created on 2022-06-16 by the reprex package (v2.0.1)
CodePudding user response:
dd <- tibble(
game_id = c(101, 102),
moves = c(
"1.e4 c5 2.Nf3 d6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 Nc6 6.Bc4 e6 7.Be3 Be7",
"1.e3 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6 5.Nc3 e5 6.Ndb5 d6"
)
)
pattern <- r"{(\d )\.(\S ) (\S )}"
dd |>
mutate(moves = str_extract_all(moves, pattern)) |>
unnest(moves) |>
mutate(
move_no = str_replace(moves, pattern, "\\1"),
white = str_replace(moves, pattern, "\\2"),
black = str_replace(moves, pattern, "\\3")
) |>
select(-moves)