Aggregate dataframe by condition in R-CodePudding

I have the following DataFrame in R:

 Y       ...    Price      Year           Quantity      Country    
010190   ...   4781       2021               4           Germany    
010190   ...   367        2021               3           Germany 
010190   ...   4781       2021               6           France    
010190   ...   250        2021               3           France    
020190   ...   690        2021               NA          USA        
020190   ...   10         2021               6           USA  
......         ...         ....              ..          ...   
217834  ...    56        2021                3           USA        
217834 ...     567       2021                9           USA

As you see the numbers in Y column startin with 01.., 02..., 21... I want to aggregate such kind of rows from 6 digit to 2 digit by considering different categorical column (e.g. Country and Year) and sum numerical columns like Quantity and Price. Also I want to take into account rows with NAs during caclulation. So, in the end I want such kind of output:

 Y     Price      Year          Quantity   Country
01     5148       2021           7         Germany
01     5031       2021           9          USA
02     700        2021           6          USA
..     ....       ...           ....        ...      
21     623        2021           12         USA

CodePudding user response：

You can use group_by and summarize from dplyr

library(dplyr)

df %>% 
  mutate(Y = sprintf(as.numeric(factor(Y, unique(Y))), fmt = 'd')) %>%
  group_by(Y, Year, Country) %>%
  summarize(across(where(is.numeric), sum))
#> # A tibble: 4 x 5
#> # Groups:   Y, Year [3]
#>    Y      Year Country Price Quantity
#>    <chr> <int> <chr>   <int>    <int>
#>  1 01     2021 France   5031        9
#>  2 01     2021 Germany  5148        7
#>  3 02     2021 USA       700       NA

CodePudding user response：

update: request:

library(dplyr)
df %>% 
  mutate(Y = substr(Y, 1, 2)) %>% 
  group_by(Y, Year, Country) %>% 
  summarise(across(c(Price, Quantity), ~sum(., na.rm = TRUE)))

We could use substr to get the first two characters from Y and group_by and summarise() with sum()

library(dplyr)
df %>% 
  mutate(Y = substr(Y, 1, 2)) %>% 
  group_by(Y, Year, Country) %>% 
  summarise(Price = sum(Price, na.rm = TRUE),
            Quantity = sum(Quantity, na.rm = TRUE)
            )

  Y      Year Country Price Quantity
  <chr> <dbl> <chr>   <dbl>    <dbl>
1 01     2021 France   5031        9
2 01     2021 Germany  5148        7
3 02     2021 USA       700        6
4 21     2021 USA       623       12

CodePudding user response：

Using aggregate and the substring of Y.

aggregate(cbind(Quantity, Price) ~ Y   Year   Country, 
          transform(dat, Y=substr(Y, 1, 2)), sum)
#    Y Year Country Quantity Price
# 1 10 2021  France        9  5031
# 2 10 2021 Germany        7  5148
# 3 20 2021     USA        7   700
# 4 21 2021     USA       12   623

Data:

dat <- structure(list(Y = c(10190L, 10190L, 10190L, 10190L, 20190L, 
20190L, 217834L, 217834L), foo = c("...", "...", "...", "...", 
"...", "...", "...", "..."), Price = c(4781L, 367L, 4781L, 250L, 
690L, 10L, 56L, 567L), Year = c(2021L, 2021L, 2021L, 2021L, 2021L, 
2021L, 2021L, 2021L), model = c(NA, NA, NA, NA, NA, NA, "Tesla", 
"Tesla"), Quantity = c(4L, 3L, 6L, 3L, 1L, 6L, 3L, 9L), Country = c("Germany", 
"Germany", "France", "France", "USA", "USA", "USA", "USA")), class = "data.frame", row.names = c(NA, 
-8L))