Home > database >  How to shuffle data frame entries in R
How to shuffle data frame entries in R

Time:11-28

I have a data frame with dimension 24,523x3,468 and I want to shuffle the entries of this dataframe. For example, I have a simple data frame

df <- data.frame(c1=c(1, 1.5, 2, 4), c2=c(1.1, 1.6, 3, 3.2), c3=c(2.1, 2.4, 1.4, 1.7)) 
df_shuffled = transform(df, c2 = sample(c2))

It works for one column, but I want to shuffle all column, or all rows. I tried

col = colnames(df)
for (i in 1:ncol(df)){
  df2 = transform(df, col[i] = sample(col[i]))
}
df2

It will produce an error like this

error

I have tried this too to shuffle, but it only shuffles rows and columns

df_shuf = df[sample(rownames(df), nrow(df)), sample(colnames(df), ncol(df))]
df_shuf

shuffle

How can I shuffle the entries of the data frame df using a loop for I by rows and columns?

CodePudding user response:

One way to solve your problem:

df[] = lapply(df, sample)

CodePudding user response:

While I find the lapply(df, sample) method the most straight-forward (and canonical), a literal fix to your for loop is to recognize that transform cannot use col[i] on the LHS of an assignment. You can instead use df2[[ col[i] ]]:

df2 <- df
col = colnames(df)
for (i in 1:ncol(df)) {
  df2[[ col[i] ]] = sample(df2[[ col[i] ]])
}

We don't really need the names, though, you can indices instead:

df2 <- df
for (i in 1:ncol(df)) {
  df2[[ i ]] = sample(df2[[ i ]])
}

This assumes, of course, that you intend to discard the correlation between values in a row. For instance, the minimum value for columns c1 and c2 occur together in row 1; after sampling, however, they may occur in different rows.

If your intent is to keep each row together, then we would just need to sample the rows, preserving the "observation" quality of the frame:

df2 <- df[sample(nrow(df)),]
  • Related