I have a dataset as shown in the first table below. I would like to create a new dataframe that combines different versions of the same item (i.e. "googleV1" and "googleV2" are combined) and adds (quantity, revenue fields) or averages (ratings field) the numbers associated so that it reads as just a single line item. The second table below illustrates my desired output. How would I go about combining and adding/averaging values as such? Thank you in advance!
I would like the solution to apply for when there may be multiple columns that work like the "group" column. I would also like to refer to the columns "quantity", "revenue", and "rating" by column number.
item | group | quantity | revenue | rating |
---|---|---|---|---|
googleV1 | blue | 4525 | $523513 | 94% |
googleV2 | blue | 2452 | $134134 | 82% |
123V5 | red | 3563 | $134134 | 82% |
123V6 | red | 34534 | $2345 | 34% |
123V7 | yellow | 4574 | $34535 | 64% |
item | group | quantity | revenue | rating |
---|---|---|---|---|
blue | 6977 | $657647 | 88% | |
123 | red | 38097 | $136479 | 58% |
123 | yellow | 4574 | $34535 | 64% |
CodePudding user response:
We may remove the substring in the first column and then do a group by summarise approach
library(dplyr)
library(stringr)
df1 %>%
group_by(item = str_remove(item, "V\\d $")) %>%
summarise(quantity = sum(quantity),
revenue = str_c("$", sum(as.numeric(str_remove(revenue, fixed("$"))))),
rating = str_c(mean(readr::parse_number(rating)), "%"), .groups = 'drop')
-output
# A tibble: 2 × 4
item quantity revenue rating
<chr> <int> <chr> <chr>
1 123 42671 $171014 60%
2 google 6977 $657647 88%
data
df1 <- structure(list(item = c("googleV1", "googleV2", "123V5", "123V6",
"123V7"), quantity = c(4525L, 2452L, 3563L, 34534L, 4574L),
revenue = c("$523513",
"$134134", "$134134", "$2345", "$34535"), rating = c("94%", "82%",
"82%", "34%", "64%")), class = "data.frame", row.names = c(NA,
-5L))