Home > OS >  reduce number of obervations by summarizing columns
reduce number of obervations by summarizing columns

Time:06-02

I have a dataframe in the following structure:

structure(list(SUB_DISTRICT_CODE = c(90101L, 90101L, 90101L, 
90101L, 90101L, 90101L, 90102L, 90102L, 90102L, 90102L, 90102L, 
90102L, 90103L, 90103L, 90103L, 90103L, 90103L, 90103L), SEX = c(1L, 
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L), AGR3 = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L), TOTAL_per_GROUP = c(184L, 1245L, 456L, 
167L, 1216L, 567L, 91L, 463L, 150L, 96L, 476L, 217L, 118L, 618L, 
256L, 116L, 627L, 293L)), row.names = 21295:21312, class = "data.frame")

At the moment there are 6 entries for every SUB_DISTRICT CODE. In the final dataframe there should only be 3 entries for every SUB_DISTRICT_CODE (one for each unique value of AGR3). The column SEX should be dropped and the TOTAL_per_GROUP values be summarized by the AGR3 column. How can I do this in a easy way (using dplyr)? Thanks

CodePudding user response:

Try using group_by() for AGR3 and SUB_DISTRICT_CODE and then summarise():

df %>% group_by(AGR3, SUB_DISTRICT_CODE) %>% 
  summarise(sum = sum(TOTAL_per_GROUP))

Output:

# AGR3 SUB_DISTRICT_CODE   sum
# <int>             <int> <int>
# 1     1             90101   351
# 2     1             90102   187
# 3     1             90103   234
# 4     2             90101  2461
# 5     2             90102   939
# 6     2             90103  1245
# 7     3             90101  1023
# 8     3             90102   367
# 9     3             90103   549

CodePudding user response:

# Base R solution: base_r_res => data.frame 
base_r_res <- aggregate(
  TOTAL_per_GROUP ~ AGR3   SUB_DISTRICT_CODE,
  data = df, 
  FUN = sum
)

# Send result to console: data.frame => stdout(console)
base_r_res

# Data table solution: import library
library(data.table)

# Aggregate by group: dt_res => data.table
dt_res = setDT(df)[,.(TOTAL_per_GROUP=sum(TOTAL_per_GROUP)), by = list(AGR3, SUB_DISTRICT_CODE)]

# Send result to console: data.table => stdout(console)
dt_res
  • Related