Home > Blockchain >  R - How to create multiple datasets based on levels of factor in multiple columns?
R - How to create multiple datasets based on levels of factor in multiple columns?

Time:08-10

I'm kinda new to R and still looking for ways to make my code more elegant. I want to create multiple datasets in a more efficient way, each based on a particular value over different columns.

This is my dataset:

df<-data.frame(A=c(1,2,2,3,4,5,1,1,2,3),
               B=c(4,4,2,3,4,2,1,5,2,2),
               C=c(3,3,3,3,4,2,5,1,2,3),
               D=c(1,2,5,5,5,4,5,5,2,3),
               E=c(1,4,2,3,4,2,5,1,2,3),
               dummy1=c("yes","yes","no","no","no","no","yes","no","yes","yes"),
               dummy2=c("high","low","low","low","high","high","high","low","low","high"))

And I need each column to be a factor:

df[colnames(df)] <- lapply(df[colnames(df)], factor)

Now, what I want to obtain is one dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes", one dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no", one dataframe called "Likert_rank_high" that contains all the observations that in the column "dummy2" have "high" and so on for all my other dummies. I want to loop or streamline the process in some way, so that there are few commands to run to get all the datasets I need.

The first two dataframes should look something like this:

Dataframe called "Likert_rank_yes" that contains all the observations that in the column "dummy1" have "yes"

Dataframe called "Likert_rank_no" that contains all the observations that in the column "dummy1" have "no"

I have to do this with several dummies with multiple levels and would like to automate/loop the process or make it more efficient, so that I don't have to subset and rename every dataframe for each dummy level. Ideally I would also need to drop the last column in each df created (the one containing the dummy considered).

I tried splitting like below but it seems it is not possible using multiple values, I just get 4 dfs (yes AND high observations, yes AND low obs, no AND high obs etc.) like so:

Splitting with a list of columns doesn't work

list_df <-   split(df[c(1:5)], list(df$dummy1,df$dummy2), sep=".")

Can you help? Thanks in advance!

CodePudding user response:

You need two lapplys:

vals <- colnames(df)[1:5]
dummies <- colnames(df)[-(1:5)]
step1 <- lapply(dummies, function(x) df[, c(vals, x)])
step2 <- lapply(step1, function(x) split(x, x[, 6]))
names(step2) <- dummies
step2
# $dummy1
# $dummy1$no
#   A B C D E dummy1
# 3 2 2 3 5 2     no
# 4 3 3 3 5 3     no
# 5 4 4 4 5 4     no
# 6 5 2 2 4 2     no
# 8 1 5 1 5 1     no
# 
# $dummy1$yes
#    A B C D E dummy1
# 1  1 4 3 1 1    yes
# 2  2 4 3 2 4    yes
# 7  1 1 5 5 5    yes
# 9  2 2 2 2 2    yes
# 10 3 2 3 3 3    yes
# 
# 
# $dummy2
# $dummy2$high
#    A B C D E dummy2
# 1  1 4 3 1 1   high
# 5  4 4 4 5 4   high
# 6  5 2 2 4 2   high
# 7  1 1 5 5 5   high
# 10 3 2 3 3 3   high
# 
# $dummy2$low
#   A B C D E dummy2
# 2 2 4 3 2 4    low
# 3 2 2 3 5 2    low
# 4 3 3 3 5 3    low
# 8 1 5 1 5 1    low
# 9 2 2 2 2 2    low

For the first data set ("dummy1" and "no") use step2$dummy1$no or step2[[1]][[1]] or step2[["dummy1"]][["no"]].

  • Related