first-time poster, so I hope I'm giving enough information, and I hope that this question isn't an obvious duplicate (I ran a search and couldn't find exactly what I needed). If I can clarify anything, please let me know!
I'm an R novice working on a group project using NBA play-by-play data, and my group is looking at the effect of the shot clock on certain possession outcomes. Right now, we have dummy variables for every second elapsed on the shot clock in each row. For example, a possession that lasted 10 seconds would have a value of 1 for the dummies 24, 23, 22, 21, 20, 19, 18, 17, 16, and 15 and a zero for the remaining dummies 14-1.
Here's a simplified sample of what the current dataset would look like:
PossessionID | OffensiveTeam | DefensiveTeam | Points | 24 | 23 | 22 | 21 | 20 |
---|---|---|---|---|---|---|---|---|
12345 | CHI | IND | 2 | 1 | 1 | 1 | 0 | 0 |
12346 | IND | CHI | 0 | 1 | 1 | 0 | 0 | 0 |
12347 | CHI | IND | 3 | 1 | 1 | 1 | 1 | 1 |
And here's what we need it to look like for those same 3 entries from above:
PossessionID | OffensiveTeam | DefensiveTeam | Points | 24 | 23 | 22 | 21 | 20 |
---|---|---|---|---|---|---|---|---|
12345 | CHI | IND | 2 | 1 | 0 | 0 | 0 | 0 |
12345 | CHI | IND | 2 | 0 | 1 | 0 | 0 | 0 |
12345 | CHI | IND | 2 | 0 | 0 | 1 | 0 | 0 |
12346 | IND | CHI | 0 | 1 | 0 | 0 | 0 | 0 |
12346 | IND | CHI | 0 | 0 | 1 | 0 | 0 | 0 |
12347 | CHI | IND | 3 | 1 | 0 | 0 | 0 | 0 |
12347 | CHI | IND | 3 | 0 | 1 | 0 | 0 | 0 |
12347 | CHI | IND | 3 | 0 | 0 | 1 | 0 | 0 |
12347 | CHI | IND | 3 | 0 | 0 | 0 | 1 | 0 |
12347 | CHI | IND | 3 | 0 | 0 | 0 | 0 | 1 |
In case the difference isn't clear, we need to create rows such that each row has only one of the dummies equal 1 and hold the rest of the values in the row constant. Note how the 1s "cascade" down with each new row, but the rest of the values are the same for each row from the original dataset.
Does anyone have a suggestion for how best to create these new rows? Ideally would be one that uses nothing beyond base R, tidyverse, and dplyr. Thanks so much!
CodePudding user response:
df %>%
pivot_longer(-(1:4))%>%
filter(value>0) %>%
mutate(nm = name) %>%
pivot_wider(values_fill = 0) %>%
select(-nm)
# A tibble: 10 x 9
PossessionID OffensiveTeam DefensiveTeam Points `24` `23` `22` `21` `20`
<int> <chr> <chr> <int> <int> <int> <int> <int> <int>
1 12345 CHI IND 2 1 0 0 0 0
2 12345 CHI IND 2 0 1 0 0 0
3 12345 CHI IND 2 0 0 1 0 0
4 12346 IND CHI 0 1 0 0 0 0
5 12346 IND CHI 0 0 1 0 0 0
6 12347 CHI IND 3 1 0 0 0 0
7 12347 CHI IND 3 0 1 0 0 0
8 12347 CHI IND 3 0 0 1 0 0
9 12347 CHI IND 3 0 0 0 1 0
10 12347 CHI IND 3 0 0 0 0 1
CodePudding user response:
Here is another approach:
- Small function
f(n)
return a tibble of sizen
rows by 24 columns, with 1's on the diagonal
f <- function(n) {
m=matrix(rep(0,n*24), nrow=n,ncol=24)
diag(m) <- 1
setNames(as_tibble(m), paste0("v",24:1))
}
- Apply the function to the original data and unnest
k %>%
rowwise() %>%
mutate(vars = list(f(sum(c_across(`24`:`20`))))) %>%
select(-(5:9)) %>%
unnest(vars)
Output:
PossessionID OffensiveTeam DefensiveTeam Points v24 v23 v22 v21 v20 v19 v18 v17 v16 v15 v14 v13 v12 v11 v10 v9 v8 v7 v6
<int> <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12345 CHI IND 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 12345 CHI IND 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 12345 CHI IND 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 12346 IND CHI 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 12346 IND CHI 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 12347 CHI IND 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 12347 CHI IND 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
8 12347 CHI IND 3 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 12347 CHI IND 3 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10 12347 CHI IND 3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# … with 5 more variables: v5 <dbl>, v4 <dbl>, v3 <dbl>, v2 <dbl>, v1 <dbl>
Input:
k = structure(list(PossessionID = 12345:12347, OffensiveTeam = c("CHI",
"IND", "CHI"), DefensiveTeam = c("IND", "CHI", "IND"), Points = c(2L,
0L, 3L), `24` = c(1L, 1L, 1L), `23` = c(1L, 1L, 1L), `22` = c(1L,
0L, 1L), `21` = c(0L, 0L, 1L), `20` = c(0L, 0L, 1L)), row.names = c(NA,
-3L), class = "data.frame")