Collapse and summarize while maintaining most frequent character variable by group-CodePudding

I have a data frame:

df <- data.frame(resource = c("gold", "gold", "gold", "silver", "silver", "gold", "silver", "bronze"), amount = c(500, 2000, 4, 8, 100, 2000, 3, 5), unit = c("g", "g", "kg", "ton", "kg", "g", "ton", "kg"), price = c(10, 10, 10000, 50000, 50, 10, 50000, 20))

I want to calculate the total amount of each resource in the most common unit.

My idea to approach this would be to:

calculate the total value of each resource
determine the most common unit of measurement and the corresponding price
divide the total value by this price while maintaining a variable that indicates the measurement unit and ideally the price as well.

The result should look like

resource value  price unit amount
bronze   100    20     kg  100  
gold     85000  10     g   8500
silver   555000 50000  ton 11.1

If two measurement units are equally frequent it can be random, which one is used.

CodePudding user response：

I doubt that the format you desire is really helpful (as PeterK pointed out in the comments), but here we go:

df <- data.frame(resource = c("gold", "gold", "gold", "silver", "silver", "gold", "silver", "bronze"), amount = c(500, 2000, 4, 8, 100, 2000, 3, 5), unit = c("g", "g", "kg", "ton", "kg", "g", "ton", "kg"), price = c(10, 10, 10000, 50000, 50, 10, 50000, 20))

# calculate total value
DT <- setDT(df)[, .(value = sum(amount * price)), by = resource]

# create wide data
#  variables we want to cast wide
cols <- c("amount", "unit")
#  cast to wide
DT.wide <- dcast(setDT(df), resource ~ rowid(resource), value.var = cols)
new_colorder <- CJ(unique(rowid(df$resource)), cols, sorted = FALSE)[, paste(cols, V1, sep = "_")]
#  reorder the relevant columns
setcolorder(DT.wide, c(setdiff(names(DT.wide), new_colorder), new_colorder))

# join together
DT[DT.wide, on = .(resource)]

#    resource  value amount_1 unit_1 amount_2 unit_2 amount_3 unit_3 amount_4 unit_4
# 1:   bronze    100        5     kg       NA   <NA>       NA   <NA>       NA   <NA>
# 2:     gold  85000      500      g     2000      g        4     kg     2000      g
# 3:   silver 555000        8    ton      100     kg        3    ton       NA   <NA>

CodePudding user response：

I found an answer to my edited question following this thread Return most frequent string value for each group and modifying the code to fit my needs:

df %>% group_by(resource) %>% summarize (value = sum(price * amount), freq_unit =names(which.max(table(unit))), freq_price = as.numeric(names(which.max(table(price)))), total_amount = value/freq_price)

gives me

 resource  value freq_unit freq_price total_amount
  <chr>     <dbl> <chr>          <dbl>        <dbl>
1 bronze      100 kg                20          5  
2 gold      85000 g                 10       8500  
3 silver   555000 ton            50000         11.1