Home > Enterprise >  Count consecutive event ocurrence of string row value A per row value B in R
Count consecutive event ocurrence of string row value A per row value B in R

Time:09-28

I have a big data set with multiple football matches. Right now the format is wide and I want to count winning streaks, drawing streaks and loosing streaks and teams by their match.

In this case I have the following variables:

  1. Home_team: a string with a name of a country (Example: England, Spain, etc)
  2. Away_team: a string with a name of a country (Example: France, Germany, etc)
  3. Results: a string with three categories (HW: Home win, AW: Away win, D: Draw)

To give a simple example, my data looks something like this:

Home_team <- c("Peru","France","England","Senegal", "Chile", "Colombia","France","Spain","Colombia", "Angola", "Ecuador", "France",
               "Peru")
Away_team <- c("Brasil","Germany","Togo","Egypt", "Ecuador", "Argentina","Netherlands","Burkina Faso","New Zealand", "Venezuela", "Portugal", "Canada",
           "United States")
Results <- c("HW","HW","AW","D","AW","HW","HW","AW","HW","D","D","HW","D")

df_example <- data.frame(Home_team,Away_team,Results)
df_example

So in this example the following things happened:

  • Peru (row 1) had a winning streak of 1 entering the game against the United States and drew the match.
  • France entered a winning streak of 2 matches to their game against Canada.
  • Colombia would be also entering a 2 game winning streak.
  • Ecuador lost against Chile (1 game losing streak) and then drew against Portugal in their next match.

I was thinking that an easier way to do this is put everything in long format and count "Wins", "Losses" and "Draws". And every time the streak stops the counting starts again. But I am not sure if that is the best approach.

The big picture is that I would like to know if streaks (winning, loosing or even drawing) has an effect on the result of the next match.

Any help would be highly appreciated.

CodePudding user response:

Here is one approach to try.

First, put data into long form. Using case_when you can determine the outcome for each each (e.g., the home team with "HW" game gets a "Win", while the away team gets a "Loss").

For each team, you can group the streaks using rleid from data.table. Every time the outcome changes, it moves on to a new streak.

Then, you can count up the number of games in a streak for a given team and outcome. It will be the row_number() for a given streak, team, and outcome.

Finally, you put back into wider form if desired. The new columns indicate the streak (number of games, and outcome) for the home and away teams, coming into the current game.

library(tidyverse)
library(data.table)

df_example %>%
  mutate(Game = row_number()) %>%
  pivot_longer(cols = c(Home_team, Away_team), names_to = "Location", values_to = "Team") %>%
  mutate(Outcome = case_when(
    Results == "HW" & Location == "Home_team" ~ "Win",
    Results == "HW" & Location == "Away_team" ~ "Loss",
    Results == "AW" & Location == "Home_team" ~ "Loss",
    Results == "AW" & Location == "Away_team" ~ "Win",
    Results == "D" ~ "Draw",
    TRUE ~ NA_character_
  )) %>%
  group_by(Team) %>%
  mutate(Change = rleid(Outcome)) %>%
  group_by(Change, .add = T) %>%
  mutate(Streak = row_number()) %>%
  group_by(Team) %>%
  mutate(Last = paste(lag(Streak, default = 0), lag(Outcome, default = "-"))) %>%
  pivot_wider(id_cols = c(Game, Results), names_from = Location, values_from = c(Team, Last)) 

Outcome

    Game Results Team_Home_team Team_Away_team Last_Home_team Last_Away_team
   <int> <chr>   <chr>          <chr>          <chr>          <chr>         
 1     1 HW      Peru           Brasil         0 -            0 -           
 2     2 HW      France         Germany        0 -            0 -           
 3     3 AW      England        Togo           0 -            0 -           
 4     4 D       Senegal        Egypt          0 -            0 -           
 5     5 AW      Chile          Ecuador        0 -            0 -           
 6     6 HW      Colombia       Argentina      0 -            0 -           
 7     7 HW      France         Netherlands    1 Win          0 -           
 8     8 AW      Spain          Burkina Faso   0 -            0 -           
 9     9 HW      Colombia       New Zealand    1 Win          0 -           
10    10 D       Angola         Venezuela      0 -            0 -           
11    11 D       Ecuador        Portugal       1 Win          0 -           
12    12 HW      France         Canada         2 Win          0 -           
13    13 D       Peru           United States  1 Win          0 - 
  • Related