Home > OS >  Permute labels in a dataframe but for pairs of observations
Permute labels in a dataframe but for pairs of observations

Time:02-23

Not sure title is clear or not, but I want to shuffle a column in a dataframe, but not for every individual row, which is very simple to do using sample(), but for pairs of observations from the same sample.

For instance, I have the following dataframe df1:

>df1
sampleID groupID  A B C D E F
438   1      1      0      0      0      0      0
438   1      0      0      0      0      1      1
386   1      1      1      1      0      0      0
386   1      0      0      0      1      0      0
438   2      1      0      0      0      1      1
438   2      0      1      1      0      0      0
582   2      0      0      0      0      0      0
582   2      1      0      0      0      1      0
597   1      0      1      0      0      0      1
597   1      0      0      0      0      0      0

I want to randomly shuffle the labels here for groupID for each sample, not observation, so that the result looks like:

>df2
sampleID groupID  A B C D E F
438   1      1      0      0      0      0      0
438   1      0      0      0      0      1      1
386   2      1      1      1      0      0      0
386   2      0      0      0      1      0      0
438   1      1      0      0      0      1      1
438   1      0      1      1      0      0      0
582   1      0      0      0      0      0      0
582   1      1      0      0      0      1      0
597   2      0      1      0      0      0      1
597   2      0      0      0      0      0      0

Notice that in column 2 (groupID), sample 386 is now 2 (for both observations).

I have searched around but haven't found anything that works the way I want. What I have now is just shuffling the second column. I tried to use dplyr as follows:

df2 <- df1 %>%
  group_by(sampleID) %>%
  mutate(groupID = sample(df1$groupID, size=2))

But of course that only takes all the group IDs and randomly selects 2.

Any tips or suggestions would be appreciated!

CodePudding user response:

One technique would be to extract the unique combinations so you have one row per sampleID, then you can shuffle and merge the shuffled items back to the main table. Here's what that would look like

library(dplyr)
df1 %>% 
  distinct(sampleID, groupID) %>% 
  mutate(shuffle_groupID = sample(groupID)) %>% 
  inner_join(df1)

CodePudding user response:

Using dplyr nest_by and unnest:

library(dplyr)

df1 |>
    nest_by(sampleID, groupID) |>
    mutate(groupID = sample(groupID, n())) |>
    unnest(cols = c(data))


  # A tibble: 10 x 3
# Groups:   sampleID, groupID [4]
   sampleID groupID     A
      <dbl>   <int> <dbl>
 1      386       1     1
 2      386       1     0
 3      438       1     0
 4      438       1     0
 5      438       1     0
 6      438       1     1
 7      582       2     0
 8      582       2     0
 9      597       1     1
10      597       1     0
  • Related