Grouping rows by multiple keys and get the average of numerical and mode of categorical-CodePudding

I'm having a dataframe with both numerical and categorical variables.

The data frame has 2 keys where I need group by. When the rows are grouped by keys so & li, then categorical variables need to be selected based on mode & numerical by average.

   SO      LI        A         B
1 2000     20        P          22
2 2000     20        P          40
3 1000     10        Q          80

The output needs to be,

   SO      LI        A         B
1 2000     20        P          31
2 1000     10        Q          80

I used the following code so far.

library(plyr)
groupColumns = c("so","li")
dataColumns = c(colnames(numericalColumns(Planning_DF)))
res = ddply(Planning_DF, groupColumns, function(x) colMeans(x[dataColumns]))
head(res)

so the numerical column grouping & average happening. How to get the categorical variables Mode to this?

CodePudding user response：

It is easier with dplyr

library(dplyr)
groupColumns = c("SO","LI")
Planning_DF %>%
    group_by(across(all_of(groupColumns))) %>%
    summarise(across(where(is.numeric), mean), 
           across(where(is.character), Mode), .groups = 'drop')

-output

# A tibble: 2 × 4
     SO    LI     B A    
  <int> <int> <dbl> <chr>
1  1000    10    80 Q    
2  2000    20    31 P

where

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

If we need to do this in plyr, use colwise with a custom function that does the mean or Mode based on the column type

f1 <- function(x) if(is.numeric(x)) mean(x, na.rm = TRUE) else Mode(x)
plyr::ddply(Planning_DF, .variables = groupColumns, 
     .fun = plyr::colwise(f1))

-output

    SO LI A  B
1 1000 10 Q 80
2 2000 20 P 31

data

Planning_DF <- structure(list(SO = c(2000L, 2000L, 1000L), LI = c(20L, 20L, 
10L), A = c("P", "P", "Q"), B = c(22L, 40L, 80L)), 
class = "data.frame", row.names = c("1", 
"2", "3"))