How to shuffle one column of a dataframe based on a row value in R?-CodePudding

I have a sports dataset that reads as follows:

season  team   tm  region 
2015    sharks shk  north
2015    dogs   dgs  south
2015    bears  brs  south
2015    cats   cts  north
2015    cows   cws  north
2014    sharks shk  north
2014    dogs   dgs  south
2014    bears  brs  south
2014    cats   cts  north
2014    cows   cws  north

I want to shuffle the region column, which I know how to do. However, for each year (2015 and 2014), there should be 3 "north" and 2 "south". In addition, I want 2015 and 2014 to have the same random region for a specific team. So, 2015 sharks and 2014 sharks should have the same region even after being randomized. This is an example of how the randomization might look:

season  team   tm  region 
2015    sharks shk  south
2015    dogs   dgs  south
2015    bears  brs  north
2015    cats   cts  north
2015    cows   cws  north
2014    sharks shk  south
2014    dogs   dgs  south
2014    bears  brs  north
2014    cats   cts  north
2014    cows   cws  north

Thank you for the help!

CodePudding user response：

I would do in the next way

    data = data.frame(
  stringsAsFactors = FALSE,
            season = c(2015L,2015L,2015L,2015L,
                       2015L,2014L,2014L,2014L,2014L,2014L),
              team = c("sharks","dogs","bears",
                       "cats","cows","sharks","dogs","bears","cats","cows"),
                tm = c("shk","dgs","brs","cts",
                       "cws","shk","dgs","brs","cts","cws"),
            region = c("north","south","south",
                       "north","north","north","south","south","north","north")
)
head(data)

north_south = sample(c("north", "south"), 5, prob = c(0.6, 0.4), replace = T)

data2 = data.frame(data, Region2 = rep(north_south, 2))

CodePudding user response：

Given the restriction that the region should be replicated in each year, this problem reduces to randomly assigning each team a region and then replicating this assignment throughout the rest of your data frame.

Let's start with a toy data set that has three teams (I drop sharks and bears here to make the example a bit smaller).

df1 <- data.frame(
  season = c(rep(2015, 3), rep(2014, 3)),
  team = c("dogs", "cats", "cows", "dogs", "cats", "cows"),
  tm = c("dgs", "cts", "cws", "dgs", "cts", "cows")
)
df1
#>   season team   tm
#> 1   2015 dogs  dgs
#> 2   2015 cats  cts
#> 3   2015 cows  cws
#> 4   2014 dogs  dgs
#> 5   2014 cats  cts
#> 6   2014 cows cows

Now we can make a data frame that contains the distinct teams:

df2 <- data.frame(
  team = c("dogs", "cats", "cows")
)
df2
#>   team
#> 1 dogs
#> 2 cats
#> 3 cows

We can add a variable to df2 called region, which must have two values randomly equal "north" and one value randomly equal "south" (your example calls for three and two, respectively). We do this using $ assignment (a tidyverse equivalent using dplyr would be mutate()):

df2$region <- sample(c("north", "north", "south"))
df2
#>   team region
#> 1 dogs  north
#> 2 cats  south
#> 3 cows  north

The last step is to link df2 back up with df1 using merge() (a tidyverse equivalent using dplyr would be left_join()):

df <- merge(df1, df2)
df
#>   team season   tm region
#> 1 cats   2015  cts  south
#> 2 cats   2014  cts  south
#> 3 cows   2015  cws  north
#> 4 cows   2014 cows  north
#> 5 dogs   2015  dgs  north
#> 6 dogs   2014  dgs  north

Note that merge() here reorders your rows and columns, whereas dplyr::left_join() preserves the row and column order of the first argument:

dplyr::left_join(df1, df2)
#> Joining, by = "team"
#>   season team   tm region
#> 1   2015 dogs  dgs  north
#> 2   2015 cats  cts  south
#> 3   2015 cows  cws  north
#> 4   2014 dogs  dgs  north
#> 5   2014 cats  cts  south
#> 6   2014 cows cows  north

For more on joins with a focus on the tidyverse approach (dplyr::left_join() and others), see this link.