How to combine columns based on column name in R?-CodePudding

Similar questions have been asked here, here, and here. However, I cant get those solutions to work for my problem.

I'm trying to combine columns based on their names and then create a matrix/dataframe of every pair of variables. Hopefully my example will explain more clearly.

For example, imagine we have a data frame like the one below:

# create some data
set.seed(100)
dfOG <- data.frame(
  day = sample(c('1', '2'), 3, replace = T),
  rain = sample(c('yes', 'no'), 3, replace = T),
  val1 = runif(3)
)

I am applying a process (that is outside of my control) that splits the categorical variables into dummy variables (with a dummy for each level). In the end, I get a matrix where the columns are every pairwise variable. The output looks something like this:

# create matrix of all pairs
name2 <- c('day.1', 'day.2',
           'rain.yes', 'rain.no', 'val1')

nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
  newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
     day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,]    9.242116   30.973623      0.9311719      1.943265   20.23192   3.8058106
[2,]    7.238372    6.248052     17.4839077      5.251022   11.23247   0.7162231
[3,]    1.046449   11.744293      2.4999295      3.380434   11.31048   4.2160769
     day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,]    0.766974      17.576561      9.348420  0.9030936       2.066487
[2,]    4.979445       5.406032      3.905483  5.5516888       8.371235
[3,]   13.735530       1.925034      1.250488  6.5460690       9.214908
     rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,]       10.15267          7.067098         3.527963     13.420953
[2,]       19.75727         22.259788         9.411371     10.040507
[3,]       15.76831         11.416835         8.630324      1.451295
     rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,]      4.330075    25.4360600        15.317283        0.349195     12.51062
[2,]      5.495578    11.0861832        11.256991       13.882071     30.86277
[3,]      6.680542     0.2620275         9.630859       36.926827     11.38734
     val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168   6.731429     14.124068    12.797966 41.294648
[2,] 3.69760484   6.137335      1.391675    12.639562  1.033024
[3,] 0.08002734  23.743638      6.804015     9.374034 27.107049

We can see above, that newMat2 contains every pair of variables, after the categorical variables have been split into dummies.

What I'm trying to do is recombine those dummy variables back into a single variable by summing the appropriate column's rows. My final output would be a matrix/dataframe of every pair of recombined variables.

For example, if we just look at the variable day. This variable has been split into day.1 and day.2. If I recombined this variable for every pair we would have a column for day.day, day.rain, and day.val1. Doing this manually could look like this:

day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)

In the above code, I'm summing (row-wise), the columns that should be combined.

Desired output:

More explicitly, If I were to recombine the entire of newMat2 manually, it would look like this:

dfNew <- data.frame(
            day.day  = apply(newMat2[,c(1,2,6,7)], 1, sum),
            day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
            day.val1 = apply(newMat2[,c(5,10)], 1, sum),
            rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
            rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
            rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
            val1.day = apply(newMat2[,c(21,22)], 1, sum),
            val1.rain = apply(newMat2[,c(23,24)], 1, sum), 
            val1.val1 = newMat2[,c(25)] 
)
> dfNew  
   day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203
2 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124
3 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805
  val1.val1
1 41.294648
2  1.033024
3 27.107049

However, in my real data, I have over 1000 columns, some with many different factor levels and as a result, manually combining them would take a long time. Is there any way to automate this process?

CodePudding user response：

With tidyverse functions:

library(tidyverse)

newMat2 %>% 
  as_tibble(rownames = "id") %>% 
  pivot_longer(-id) %>% 
  mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
                        paste0, collapse = "")) %>% 
  group_by(id, name) %>% 
  summarise(value = sum(value)) %>% 
  pivot_wider()

  id    `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1          44.8      29.8        21.1       42.0        26.3        25.9       8.15        26.9       41.3 
2 2          19.2      32.0        16.8       44.7        56.8        40.9       9.83        14.0        1.03
3 3          30.7       9.06       17.9       31.9        66.6        12.8      23.8         16.2       27.1

CodePudding user response：

library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9] ).*:([a-z0-9] ).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))

      day.day  day.rain day.val1 rain.day rain.rain rain.val1  val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529  26.26154  25.93157  8.150991  26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027  56.81022  40.90328  9.834940  14.03124  1.033024
[3,] 30.74235  9.055886 17.85655 31.92579  66.60485  12.83863 23.823666  16.17805 27.107049