How should I count recurrent string elements in a column? In cells that describes scorers in a match-CodePudding

I have this data that have these columns: goal for, goal against, and scorers. In one observation, cell in scorers column might have multiple scorers such as: "Tammy Abraham (34), Zaniolo (90)". How should I count each scorers goal at every match for the whole dataframe? To illustrate:

#create reprex to ask about goalscorer tally


gf <- c(5,4,0,0,1)
ga <- c(3,3,0,2,0)
scorer <- c("a,b,c,d,e","a,b,c,d", NA, NA,"e")
dat <- data.frame(gf,ga,scorer)

This is what the sample looks like:

  gf ga    scorer
1  5  3 a,b,c,d,e
2  4  3   a,b,c,d
3  0  0      <NA>
4  0  2      <NA>
5  1  0         e

Would anyone help? Also, is this reprex, enough?

I only put the goal for and goal against here in the example, but in the actual data, there are many more observations like formation, weather, competition etc. Is it possible to say that scorer a tends to score in a low scoring match in wet condition on 4-4-2 formation? Something like that. But the problem is, the scorer a is found together with other scorers in the scorer column.

CodePudding user response：

A tidyverse approach. You may want to strsplit the scorer string into a list and then unnest the list:

library(dplyr)
library(tidyr)

dat |>
    mutate(scorer = strsplit(scorer, ",")) |>
    unnest(cols = c(scorer))

# A tibble: 12 × 3
      gf    ga scorer
   <dbl> <dbl> <chr> 
 1     5     3 a     
 2     5     3 b     
 3     5     3 c     
 4     5     3 d     
 5     5     3 e     
 6     4     3 a     
 7     4     3 b     
 8     4     3 c     
 9     4     3 d     
10     0     0 NA    
11     0     2 NA    
12     1     0 e

CodePudding user response：

Counting the number of goals per scorer and game:

library(dplyr)
library(tidyr)
dat %>%
  # make match ID:
  mutate(match_ID = row_number()) %>%
  # split `scorer`:
  separate_rows(scorer) %>%
  # remove rows where `scorer` is `NA`:
  filter(!is.na(scorer)) %>%
  # group by:
  group_by(match_ID,scorer) %>%
  # count number of goals per match:
  summarise(N_goals = n())
# A tibble: 10 × 3
# Groups:   match_ID [3]
   match_ID scorer N_goals
      <int> <chr>    <int>
 1        1 a            1
 2        1 b            1
 3        1 c            1
 4        1 d            1
 5        1 e            1
 6        2 a            1
 7        2 b            1
 8        2 c            1
 9        2 d            1
10        5 e            1