I have a sports dataset that reads as follows:
season team tm region
2015 sharks shk north
2015 dogs dgs south
2015 bears brs south
2015 cats cts north
2015 cows cws north
2014 sharks shk north
2014 dogs dgs south
2014 bears brs south
2014 cats cts north
2014 cows cws north
I want to shuffle the region column, which I know how to do. However, for each year (2015 and 2014), there should be 3 "north" and 2 "south". In addition, I want 2015 and 2014 to have the same random region for a specific team. So, 2015 sharks and 2014 sharks should have the same region even after being randomized. This is an example of how the randomization might look:
season team tm region
2015 sharks shk south
2015 dogs dgs south
2015 bears brs north
2015 cats cts north
2015 cows cws north
2014 sharks shk south
2014 dogs dgs south
2014 bears brs north
2014 cats cts north
2014 cows cws north
Thank you for the help!
CodePudding user response:
I would do in the next way
data = data.frame(
stringsAsFactors = FALSE,
season = c(2015L,2015L,2015L,2015L,
2015L,2014L,2014L,2014L,2014L,2014L),
team = c("sharks","dogs","bears",
"cats","cows","sharks","dogs","bears","cats","cows"),
tm = c("shk","dgs","brs","cts",
"cws","shk","dgs","brs","cts","cws"),
region = c("north","south","south",
"north","north","north","south","south","north","north")
)
head(data)
north_south = sample(c("north", "south"), 5, prob = c(0.6, 0.4), replace = T)
data2 = data.frame(data, Region2 = rep(north_south, 2))
CodePudding user response:
Given the restriction that the region should be replicated in each year, this problem reduces to randomly assigning each team a region and then replicating this assignment throughout the rest of your data frame.
Let's start with a toy data set that has three teams (I drop sharks and bears here to make the example a bit smaller).
df1 <- data.frame(
season = c(rep(2015, 3), rep(2014, 3)),
team = c("dogs", "cats", "cows", "dogs", "cats", "cows"),
tm = c("dgs", "cts", "cws", "dgs", "cts", "cows")
)
df1
#> season team tm
#> 1 2015 dogs dgs
#> 2 2015 cats cts
#> 3 2015 cows cws
#> 4 2014 dogs dgs
#> 5 2014 cats cts
#> 6 2014 cows cows
Now we can make a data frame that contains the distinct teams:
df2 <- data.frame(
team = c("dogs", "cats", "cows")
)
df2
#> team
#> 1 dogs
#> 2 cats
#> 3 cows
We can add a variable to df2
called region
, which must have two values randomly equal "north"
and one value randomly equal "south"
(your example calls for three and two, respectively). We do this using $
assignment (a tidyverse equivalent using dplyr would be mutate()
):
df2$region <- sample(c("north", "north", "south"))
df2
#> team region
#> 1 dogs north
#> 2 cats south
#> 3 cows north
The last step is to link df2
back up with df1
using merge()
(a tidyverse equivalent using dplyr would be left_join()
):
df <- merge(df1, df2)
df
#> team season tm region
#> 1 cats 2015 cts south
#> 2 cats 2014 cts south
#> 3 cows 2015 cws north
#> 4 cows 2014 cows north
#> 5 dogs 2015 dgs north
#> 6 dogs 2014 dgs north
Note that merge()
here reorders your rows and columns, whereas dplyr::left_join()
preserves the row and column order of the first argument:
dplyr::left_join(df1, df2)
#> Joining, by = "team"
#> season team tm region
#> 1 2015 dogs dgs north
#> 2 2015 cats cts south
#> 3 2015 cows cws north
#> 4 2014 dogs dgs north
#> 5 2014 cats cts south
#> 6 2014 cows cows north
For more on joins with a focus on the tidyverse approach (dplyr::left_join()
and others), see this link.