Home > Enterprise >  Creating a list with column-wise partitions of a data.frame
Creating a list with column-wise partitions of a data.frame

Time:07-20

I have a data.frame with a single "identifier" column and many additional columns. I am interested in turning this data.frame into a list of length K, whose elements are sets of columns partitioning the data.frame.

For example, given the below data.frame:

# Example data.frame
df <- data.frame(id = 1:10,
           x1 = rnorm(10),
           x2 = rnorm(10),
           x3 = rnorm(10),
           x4 = rnorm(10))

I'd like to have some function that converts it into this:

# Partitioning function
foo(df, partitions = 3)

# Expected output 
list(data.frame(id = df$id, x1 = df[ ,2]), 
     data.frame(id = df$id, x2 = df[ ,3]), 
     data.frame(id = df$id, x3 = df[ ,4], x4 = df[ ,5]), 

Bonus points if you can extend this so that you can specify how many non-id columns each element of the list should contain by passing a numeric vector. Imagine the same output with an input that looks like this or equivalent.

columns_per_element <- c(1,1,2)
foo(df, columns_per_element) 

CodePudding user response:

It is actually easier to define a function with the splitting sequence. The key functions here are repand split.default i.e.

f2 <- function(df, n, split){
  i1 <- rep(seq(n), split)
  res_list <- split.default(df[-1], i1)
  return(lapply(res_list, function(i)cbind.data.frame(ID = df$id, i)))
}

f2(df, 3, c(1, 1, 2))
$`1`
   ID          x1
1   1  1.54960977
2   2 -1.59144017
3   3  0.02853548
4   4 -0.14231391
5   5  1.26989801
6   6  0.87495876
7   7  0.27373774
8   8 -0.75600983
9   9  0.32216493
10 10 -1.05113771

$`2`
   ID         x2
1   1  0.8529416
2   2  0.4555094
3   3 -0.3620756
4   4  1.4779813
5   5 -1.6484066
6   6 -0.5697431
7   7 -0.2139384
8   8  0.1619074
9   9 -0.5390306
10 10 -0.2228809

$`3`
   ID         x3           x4
1   1 -0.2579865  1.185526074
2   2 -0.0519554 -0.388179976
3   3  2.5350092 -0.675504829
4   4 -1.7051955  0.073448252
5   5  0.6207733 -0.637220508
6   6  0.3015831 -1.324024114
7   7 -0.5647717  0.969025962
8   8  0.1404714 -1.575383604
9   9  1.3049560 -1.846413101
10 10 -0.6716643  0.008675125



f2(df, 3, c(1, 2, 1))
$`1`
   ID          x1
1   1  1.54960977
2   2 -1.59144017
3   3  0.02853548
4   4 -0.14231391
5   5  1.26989801
6   6  0.87495876
7   7  0.27373774
8   8 -0.75600983
9   9  0.32216493
10 10 -1.05113771

$`2`
   ID         x2         x3
1   1  0.8529416 -0.2579865
2   2  0.4555094 -0.0519554
3   3 -0.3620756  2.5350092
4   4  1.4779813 -1.7051955
5   5 -1.6484066  0.6207733
6   6 -0.5697431  0.3015831
7   7 -0.2139384 -0.5647717
8   8  0.1619074  0.1404714
9   9 -0.5390306  1.3049560
10 10 -0.2228809 -0.6716643

$`3`
   ID           x4
1   1  1.185526074
2   2 -0.388179976
3   3 -0.675504829
4   4  0.073448252
5   5 -0.637220508
6   6 -1.324024114
7   7  0.969025962
8   8 -1.575383604
9   9 -1.846413101
10 10  0.008675125

CodePudding user response:

Here is solution with two parameters in the function with a vectorized column select. note this assumes the first column is id and is called id. second if the sum of the vector is greater than ncol(df)-1 (this will be your input df) it will throw an error.

f2 <- function(x,y){
  #keep id
  id <- x[,"id" , drop = FALSE]
  #keep all other variables
  df2 <- x[,-1]
  #get sequence for columns
  y2 <- lapply(cumsum(y), function(x){sequence(x)})
  
  #grab correct columns
  y3 <- c(y2[1],mapply(dplyr::setdiff ,y2[2:length(y2)],y2[1:2]))
  #recreate df
  lapply(y3,
        function(x){
         cbind.data.frame(id, df2[,x, drop = FALSE])
         })
}


f2(df, c(1,1,2)) 
  • Related