Home > front end >  What's the easiest way to duplicate observations in a dataset while making a small change to ea
What's the easiest way to duplicate observations in a dataset while making a small change to ea

Time:05-18

first-time poster, so I hope I'm giving enough information, and I hope that this question isn't an obvious duplicate (I ran a search and couldn't find exactly what I needed). If I can clarify anything, please let me know!

I'm an R novice working on a group project using NBA play-by-play data, and my group is looking at the effect of the shot clock on certain possession outcomes. Right now, we have dummy variables for every second elapsed on the shot clock in each row. For example, a possession that lasted 10 seconds would have a value of 1 for the dummies 24, 23, 22, 21, 20, 19, 18, 17, 16, and 15 and a zero for the remaining dummies 14-1.

Here's a simplified sample of what the current dataset would look like:

PossessionID OffensiveTeam DefensiveTeam Points 24 23 22 21 20
12345 CHI IND 2 1 1 1 0 0
12346 IND CHI 0 1 1 0 0 0
12347 CHI IND 3 1 1 1 1 1

And here's what we need it to look like for those same 3 entries from above:

PossessionID OffensiveTeam DefensiveTeam Points 24 23 22 21 20
12345 CHI IND 2 1 0 0 0 0
12345 CHI IND 2 0 1 0 0 0
12345 CHI IND 2 0 0 1 0 0
12346 IND CHI 0 1 0 0 0 0
12346 IND CHI 0 0 1 0 0 0
12347 CHI IND 3 1 0 0 0 0
12347 CHI IND 3 0 1 0 0 0
12347 CHI IND 3 0 0 1 0 0
12347 CHI IND 3 0 0 0 1 0
12347 CHI IND 3 0 0 0 0 1

In case the difference isn't clear, we need to create rows such that each row has only one of the dummies equal 1 and hold the rest of the values in the row constant. Note how the 1s "cascade" down with each new row, but the rest of the values are the same for each row from the original dataset.

Does anyone have a suggestion for how best to create these new rows? Ideally would be one that uses nothing beyond base R, tidyverse, and dplyr. Thanks so much!

CodePudding user response:

df %>%
  pivot_longer(-(1:4))%>%
  filter(value>0) %>%
  mutate(nm = name) %>%
  pivot_wider(values_fill = 0) %>%
  select(-nm)

# A tibble: 10 x 9
   PossessionID OffensiveTeam DefensiveTeam Points  `24`  `23`  `22`  `21`  `20`
          <int> <chr>         <chr>          <int> <int> <int> <int> <int> <int>
 1        12345 CHI           IND                2     1     0     0     0     0
 2        12345 CHI           IND                2     0     1     0     0     0
 3        12345 CHI           IND                2     0     0     1     0     0
 4        12346 IND           CHI                0     1     0     0     0     0
 5        12346 IND           CHI                0     0     1     0     0     0
 6        12347 CHI           IND                3     1     0     0     0     0
 7        12347 CHI           IND                3     0     1     0     0     0
 8        12347 CHI           IND                3     0     0     1     0     0
 9        12347 CHI           IND                3     0     0     0     1     0
10        12347 CHI           IND                3     0     0     0     0     1

CodePudding user response:

Here is another approach:

  1. Small function f(n) return a tibble of size n rows by 24 columns, with 1's on the diagonal
f <- function(n) {
 m=matrix(rep(0,n*24), nrow=n,ncol=24)
 diag(m) <- 1
 setNames(as_tibble(m), paste0("v",24:1))
}
  1. Apply the function to the original data and unnest
k %>% 
  rowwise() %>%
  mutate(vars = list(f(sum(c_across(`24`:`20`))))) %>% 
  select(-(5:9)) %>% 
  unnest(vars)

Output:

   PossessionID OffensiveTeam DefensiveTeam Points   v24   v23   v22   v21   v20   v19   v18   v17   v16   v15   v14   v13   v12   v11   v10    v9    v8    v7    v6
          <int> <chr>         <chr>          <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1        12345 CHI           IND                2     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 2        12345 CHI           IND                2     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 3        12345 CHI           IND                2     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 4        12346 IND           CHI                0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 5        12346 IND           CHI                0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 6        12347 CHI           IND                3     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 7        12347 CHI           IND                3     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 8        12347 CHI           IND                3     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 9        12347 CHI           IND                3     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
10        12347 CHI           IND                3     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0
# … with 5 more variables: v5 <dbl>, v4 <dbl>, v3 <dbl>, v2 <dbl>, v1 <dbl>

Input:

k = structure(list(PossessionID = 12345:12347, OffensiveTeam = c("CHI", 
"IND", "CHI"), DefensiveTeam = c("IND", "CHI", "IND"), Points = c(2L, 
0L, 3L), `24` = c(1L, 1L, 1L), `23` = c(1L, 1L, 1L), `22` = c(1L, 
0L, 1L), `21` = c(0L, 0L, 1L), `20` = c(0L, 0L, 1L)), row.names = c(NA, 
-3L), class = "data.frame")
  •  Tags:  
  • r
  • Related