Home > other >  Split dataframe by column and keep a set of common columns in each subset
Split dataframe by column and keep a set of common columns in each subset

Time:01-15

I have a dataframe that has too many variables for me to work with in Stata, so I'm trying to partition it vertically in R so I can work with the smaller sets in Stata. However, I need to keep 5-10 id variables (id, age, wave, weight, etc.) in each set so I can work with them individually or merge easily if needed.

For reference, there are about 5,000 variables and I need them to be roughly in groups of 500-1000 variables, so like 5 or 10 separate dfs that all have the same id variables in them.

If anyone can explain how to split it into just two that would get me somewhere I will take anything at this point.

CodePudding user response:

Use split(), you can find the full documentation here. Quick example how to divide df into 4 parts:

x <- split(your_data_frame, rep(1:4, length.out = nrow(your_data_frame), each = ceiling(nrow(your_data_frame)/4)))

After that you can transform each part of x into (for example) df as follows:

x1_4 <- as.data.frame(x[1])

Hope it helps!
Bart

CodePudding user response:

If I understand correctly, the OP wants to split the data.frame vertically, i.e., by column. The id columns must appear in each part.

For example, if a data.frame consisting of 3 id columns and 17 variable columns is to be split into 3 parts the resulting data.frames would consist of 3 id columns and 5 to 6 variable columns each.

This can be achieved using base R by

id_cols <-  c("id1", "id2", "id3")
n_parts <- 3L
var_cols <- setdiff(colnames(df0), id_cols)
df_parts <- split(var_cols, 
      cut(seq_along(var_cols), n_parts, labels = FALSE)) |>
  lapply(\(v) df0[, c(id_cols, v)])
df_parts
$`1`
     id1    id2    id3   V1   V2   V3   V4   V5   V6
1 id1_01 id2_01 id3_01 1.02 2.02 3.02 4.02 5.02 6.02
2 id1_02 id2_02 id3_02 1.04 2.04 3.04 4.04 5.04 6.04
3 id1_03 id2_03 id3_03 1.06 2.06 3.06 4.06 5.06 6.06
4 id1_04 id2_04 id3_04 1.08 2.08 3.08 4.08 5.08 6.08
5 id1_05 id2_05 id3_05 1.10 2.10 3.10 4.10 5.10 6.10

$`2`
     id1    id2    id3   V7   V8   V9   V10   V11
1 id1_01 id2_01 id3_01 7.02 8.02 9.02 10.02 11.02
2 id1_02 id2_02 id3_02 7.04 8.04 9.04 10.04 11.04
3 id1_03 id2_03 id3_03 7.06 8.06 9.06 10.06 11.06
4 id1_04 id2_04 id3_04 7.08 8.08 9.08 10.08 11.08
5 id1_05 id2_05 id3_05 7.10 8.10 9.10 10.10 11.10

$`3`
     id1    id2    id3   V12   V13   V14   V15   V16   V17
1 id1_01 id2_01 id3_01 12.02 13.02 14.02 15.02 16.02 17.02
2 id1_02 id2_02 id3_02 12.04 13.04 14.04 15.04 16.04 17.04
3 id1_03 id2_03 id3_03 12.06 13.06 14.06 15.06 16.06 17.06
4 id1_04 id2_04 id3_04 12.08 13.08 14.08 15.08 16.08 17.08
5 id1_05 id2_05 id3_05 12.10 13.10 14.10 15.10 16.10 17.10

The result df_parts is a list which contains 3 data.frames as list elements as shown.

Data

A reproducible sample dataset is created by

nr <- 5L
ni <- 3L
nc <- 17L
df0 <- cbind(
  outer(seq(ni), seq(nr), sprintf, fmt = "id%i_i") |>
    t() |>
    as.data.frame() |>
    setNames(sprintf("id%i", seq(ni))),
  outer(seq(nr) / nr / 10, seq(nc), ` `) |>
    as.data.frame()
)
df0
     id1    id2    id3   V1   V2   V3   V4   V5   V6   V7   V8   V9   V10   V11   V12   V13   V14   V15   V16   V17
1 id1_01 id2_01 id3_01 1.02 2.02 3.02 4.02 5.02 6.02 7.02 8.02 9.02 10.02 11.02 12.02 13.02 14.02 15.02 16.02 17.02
2 id1_02 id2_02 id3_02 1.04 2.04 3.04 4.04 5.04 6.04 7.04 8.04 9.04 10.04 11.04 12.04 13.04 14.04 15.04 16.04 17.04
3 id1_03 id2_03 id3_03 1.06 2.06 3.06 4.06 5.06 6.06 7.06 8.06 9.06 10.06 11.06 12.06 13.06 14.06 15.06 16.06 17.06
4 id1_04 id2_04 id3_04 1.08 2.08 3.08 4.08 5.08 6.08 7.08 8.08 9.08 10.08 11.08 12.08 13.08 14.08 15.08 16.08 17.08
5 id1_05 id2_05 id3_05 1.10 2.10 3.10 4.10 5.10 6.10 7.10 8.10 9.10 10.10 11.10 12.10 13.10 14.10 15.10 16.10 17.10
  •  Tags:  
  • Related