Similar questions have been asked here, here, and here. However, I cant get those solutions to work for my problem.
I'm trying to combine columns based on their names and then create a matrix/dataframe of every pair of variables. Hopefully my example will explain more clearly.
For example, imagine we have a data frame like the one below:
# create some data
set.seed(100)
dfOG <- data.frame(
day = sample(c('1', '2'), 3, replace = T),
rain = sample(c('yes', 'no'), 3, replace = T),
val1 = runif(3)
)
I am applying a process (that is outside of my control) that splits the categorical variables into dummy variables (with a dummy for each level). In the end, I get a matrix where the columns are every pairwise variable. The output looks something like this:
# create matrix of all pairs
name2 <- c('day.1', 'day.2',
'rain.yes', 'rain.no', 'val1')
nam2 <- expand.grid(name2, name2)
newName2 <- NULL
for(i in 1:length(nam2$Var1)){
newName2[i] <- paste0(nam2$Var2[i], ":", nam2$Var1[i])
}
set.seed(100)
newMat2 <- matrix(rexp(75, rate=.1), nrow = 3, ncol = length(newName2))
colnames(newMat2) <- newName2
> newMat2
day.1:day.1 day.1:day.2 day.1:rain.yes day.1:rain.no day.1:val1 day.2:day.1
[1,] 9.242116 30.973623 0.9311719 1.943265 20.23192 3.8058106
[2,] 7.238372 6.248052 17.4839077 5.251022 11.23247 0.7162231
[3,] 1.046449 11.744293 2.4999295 3.380434 11.31048 4.2160769
day.2:day.2 day.2:rain.yes day.2:rain.no day.2:val1 rain.yes:day.1
[1,] 0.766974 17.576561 9.348420 0.9030936 2.066487
[2,] 4.979445 5.406032 3.905483 5.5516888 8.371235
[3,] 13.735530 1.925034 1.250488 6.5460690 9.214908
rain.yes:day.2 rain.yes:rain.yes rain.yes:rain.no rain.yes:val1
[1,] 10.15267 7.067098 3.527963 13.420953
[2,] 19.75727 22.259788 9.411371 10.040507
[3,] 15.76831 11.416835 8.630324 1.451295
rain.no:day.1 rain.no:day.2 rain.no:rain.yes rain.no:rain.no rain.no:val1
[1,] 4.330075 25.4360600 15.317283 0.349195 12.51062
[2,] 5.495578 11.0861832 11.256991 13.882071 30.86277
[3,] 6.680542 0.2620275 9.630859 36.926827 11.38734
val1:day.1 val1:day.2 val1:rain.yes val1:rain.no val1:val1
[1,] 1.41956168 6.731429 14.124068 12.797966 41.294648
[2,] 3.69760484 6.137335 1.391675 12.639562 1.033024
[3,] 0.08002734 23.743638 6.804015 9.374034 27.107049
We can see above, that newMat2
contains every pair of variables, after the categorical variables have been split into dummies.
What I'm trying to do is recombine those dummy variables back into a single variable by summing the appropriate column's rows. My final output would be a matrix/dataframe of every pair of recombined variables.
For example, if we just look at the variable day
. This variable has been split into day.1
and day.2
. If I recombined this variable for every pair we would have a column for day.day
, day.rain
, and day.val1
. Doing this manually could look like this:
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum)
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum)
day.val1 = apply(newMat2[,c(5,10)], 1, sum)
In the above code, I'm summing (row-wise), the columns that should be combined.
Desired output:
More explicitly, If I were to recombine the entire of newMat2
manually, it would look like this:
dfNew <- data.frame(
day.day = apply(newMat2[,c(1,2,6,7)], 1, sum),
day.rain = apply(newMat2[,c(3,4,8,9)], 1, sum),
day.val1 = apply(newMat2[,c(5,10)], 1, sum),
rain.day = apply(newMat2[,c(11,12,16,17)], 1, sum),
rain.rain = apply(newMat2[,c(13,14,18,19)], 1, sum),
rain.val1 = apply(newMat2[,c(15,20)], 1, sum),
val1.day = apply(newMat2[,c(21,22)], 1, sum),
val1.rain = apply(newMat2[,c(23,24)], 1, sum),
val1.val1 = newMat2[,c(25)]
)
> dfNew
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain
1 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203
2 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124
3 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805
val1.val1
1 41.294648
2 1.033024
3 27.107049
However, in my real data, I have over 1000 columns, some with many different factor levels and as a result, manually combining them would take a long time. Is there any way to automate this process?
CodePudding user response:
With tidyverse
functions:
library(tidyverse)
newMat2 %>%
as_tibble(rownames = "id") %>%
pivot_longer(-id) %>%
mutate(name = map_chr(str_extract_all(name, paste(c(colnames(dfOG), ":"), collapse = "|")),
paste0, collapse = "")) %>%
group_by(id, name) %>%
summarise(value = sum(value)) %>%
pivot_wider()
id `day:day` `day:rain` `day:val1` `rain:day` `rain:rain` `rain:val1` `val1:day` `val1:rain` `val1:val1`
1 1 44.8 29.8 21.1 42.0 26.3 25.9 8.15 26.9 41.3
2 2 19.2 32.0 16.8 44.7 56.8 40.9 9.83 14.0 1.03
3 3 30.7 9.06 17.9 31.9 66.6 12.8 23.8 16.2 27.1
CodePudding user response:
library(rlist)
df <- as.data.frame(newMat2)
L <- split.default(df, f = gsub("(^[a-z0-9] ).*:([a-z0-9] ).*$", "\\1.\\2", colnames(df)))
rlist::list.cbind(lapply(L, rowSums))
day.day day.rain day.val1 rain.day rain.rain rain.val1 val1.day val1.rain val1.val1
[1,] 44.78852 29.799418 21.13501 41.98529 26.26154 25.93157 8.150991 26.92203 41.294648
[2,] 19.18209 32.046444 16.78415 44.71027 56.81022 40.90328 9.834940 14.03124 1.033024
[3,] 30.74235 9.055886 17.85655 31.92579 66.60485 12.83863 23.823666 16.17805 27.107049