I'm having a dataframe with both numerical and categorical variables.
The data frame has 2 keys where I need group by. When the rows are grouped by keys so & li, then categorical variables need to be selected based on mode & numerical by average.
SO LI A B
1 2000 20 P 22
2 2000 20 P 40
3 1000 10 Q 80
The output needs to be,
SO LI A B
1 2000 20 P 31
2 1000 10 Q 80
I used the following code so far.
library(plyr)
groupColumns = c("so","li")
dataColumns = c(colnames(numericalColumns(Planning_DF)))
res = ddply(Planning_DF, groupColumns, function(x) colMeans(x[dataColumns]))
head(res)
so the numerical column grouping & average happening. How to get the categorical variables Mode to this?
CodePudding user response:
It is easier with dplyr
library(dplyr)
groupColumns = c("SO","LI")
Planning_DF %>%
group_by(across(all_of(groupColumns))) %>%
summarise(across(where(is.numeric), mean),
across(where(is.character), Mode), .groups = 'drop')
-output
# A tibble: 2 × 4
SO LI B A
<int> <int> <dbl> <chr>
1 1000 10 80 Q
2 2000 20 31 P
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
If we need to do this in plyr
, use colwise
with a custom function that does the mean
or Mode
based on the column type
f1 <- function(x) if(is.numeric(x)) mean(x, na.rm = TRUE) else Mode(x)
plyr::ddply(Planning_DF, .variables = groupColumns,
.fun = plyr::colwise(f1))
-output
SO LI A B
1 1000 10 Q 80
2 2000 20 P 31
data
Planning_DF <- structure(list(SO = c(2000L, 2000L, 1000L), LI = c(20L, 20L,
10L), A = c("P", "P", "Q"), B = c(22L, 40L, 80L)),
class = "data.frame", row.names = c("1",
"2", "3"))