Home > Blockchain >  Aggregate columns based on categories given by another dataframe
Aggregate columns based on categories given by another dataframe

Time:04-20

I have a dataframe where each column has some vector of data. I want to apply the mean columnwise, but filtered by groups which are given by a second dataframe. That is, each column belongs to a group and this information is in the second dataframe.

Here is some example dataset: df is the dataframe with the data vectors, df_category contains the category for each column.

df=structure(list(x1 = c(0.461302090047301, -1.19974381763812, -0.888258056235799, 
0.300889698419314, 0.836911163114131, 0.0540388337324712), x2 = c(1.33736696170763, 
-0.687026295689823, 1.12205295626651, -0.848925266014684, 1.16092168555067, 
0.591202293337843), x3 = c(-0.279052669225263, -0.780435476613128, 
-0.852870619718068, -0.708611614262357, -0.761659405740852, 0.487033696695474
), x4 = c(-0.222767493777229, 1.50328295132467, 0.934670132217215, 
1.37678188537077, 0.343280062984192, 1.23279081824003), x5 = c(-1.08074586121729, 
0.208120194894818, -0.52245832008453, 0.944618465137011, 0.749834485631317, 
-0.81118414509141)), class = "data.frame", row.names = c(NA, 
-6L))

df_category=structure(list(Col_name = structure(1:5, .Label = c("x1", "x2", 
"x3", "x4", "x5"), class = "factor"), Category = structure(c(1L, 
1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

The result I want is this one:

df_result=structure(list(mean_A = c(0.899334525877468, -0.943385056663974, 
0.116897450015357, -0.274017783797685, 0.998916424332403, 0.322620563535157
), mean_B = c(-0.527522008073261, 0.310322556535454, -0.146886269195128, 
0.537596245415141, 0.110485047624885, 0.302880123281364)), class = "data.frame", row.names = c(NA, 
-6L))

CodePudding user response:

We can use tidyverse to reshape the data values, merge the category data, and compute means for groups "A" and "B":

library(tidyverse)

df_result <- df %>% 
  mutate(idx = row_number()) %>% 
  pivot_longer(-idx) %>% 
  inner_join(df_category, c(name = 'Col_name')) %>% 
  group_by(Category, idx) %>% 
  summarize(mean = mean(value)) %>% 
  pivot_wider(names_from = Category, values_from = mean, names_prefix = 'mean_') %>%
  select(-idx)

  mean_A mean_B
   <dbl>  <dbl>
1  0.899 -0.528
2 -0.943  0.310
3  0.117 -0.147
4 -0.274  0.538
5  0.999  0.110
6  0.323  0.303

CodePudding user response:

in Base R:

a <- with(df_category, setNames(Category, Col_name))[names(df)[col(df)]]
tapply(unlist(df), list(row(df), a), mean)

          A          B
1  0.8993345 -0.5275220
2 -0.9433851  0.3103226
3  0.1168975 -0.1468863
4 -0.2740178  0.5375962
5  0.9989164  0.1104850
6  0.3226206  0.3028801

Another option:

sapply(with(df_category, split.default(df[Col_name], Category)), rowMeans)
              A          B
[1,]  0.8993345 -0.5275220
[2,] -0.9433851  0.3103226
[3,]  0.1168975 -0.1468863
[4,] -0.2740178  0.5375962
[5,]  0.9989164  0.1104850
[6,]  0.3226206  0.3028801
  • Related