Home > Mobile >  Aggregating by average on 2 conditions while accounting for NA
Aggregating by average on 2 conditions while accounting for NA

Time:05-11

I've the following table

Data = structure(list(Countries = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("China", "India", "Vietnam"), class = "factor"), Year = c(2019L, 2018L, 2018L, 2018L, 2017L,  2017L, 2019L, 2019L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L,2017L, 2017L, 2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L,  2019L, 2019L, 2018L, 2017L, 2017L), Food = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bread","Rice"), class = "factor"), Price = c(2.8, 2.8, 2.7, NA, 2.6, 2.58, 2.53, 2.5, NA, NA, 2.395, 2.9, 2.8, 2.75, 2.66, 2.5, 11.5,11.3, 11.2, 11, NA, 10.7, 10.7, NA, NA, 10.3, 10.1, 10)), class = "data.frame", row.names = c(NA, -28L))

The table shows the following:

Countries Year Food Price
China 2019 Bread 2.8
China 2018 Bread 2.8
China 2018 Bread 2.7
China 2018 Bread NA
China 2017 Bread 2.6
China 2017 Bread 2.58
India 2019 Bread 2.53
India 2019 Bread 2.5
India 2018 Bread NA
India 2018 Bread NA
India 2017 Bread 2.395
Vietnam 2018 Bread 2.9
Vietnam 2018 Bread 2.8
Vietnam 2018 Bread 2.75
Vietnam 2017 Bread 2.66
Vietnam 2017 Bread 2.5
China 2019 Rice 11.5
China 2018 Rice 11.3
China 2018 Rice 11.2
China 2018 Rice 11.0
China 2017 Rice NA
China 2017 Rice 10.7
Vietnam 2019 Rice 10.7
Vietnam 2019 Rice NA
Vietnam 2019 Rice NA
Vietnam 2018 Rice 10.3
Vietnam 2017 Rice 10.1
Vietnam 2017 Rice 10.0

Does anyone know how to aggregate the price of the individual food items based on the countries and the year (the actual dataset has more countries, year and food item but in the same format) using dplyr and/or tiderverse while accounting for NA? i.e.

  1. If the price of bread for 2018 is 2.8, 2.7 and NA, the average will be (2.8 2.7)/2 instead of (2.8 2.7 0)/3
  2. If the price of bread for the entire year is NA, we can discard this and don't even bother printing it on the output table.

Output table

Countries Year Food Price
China 2019 Bread 2.8
China 2018 Bread 2.8
China 2017 Bread 2.6
India 2019 Bread 2.5
India 2017 Bread 2.4
Vietnam 2018 Bread 2.8
Vietnam 2017 Bread 2.6
China 2019 Rice 11.5
China 2018 Rice 11.2
China 2017 Rice 10.7
Vietnam 2019 Rice 10.7
Vietnam 2018 Rice 10.3
Vietnam 2017 Rice 10.1

Also out of genuine curiosity, is this even possible to do in base R?

CodePudding user response:

For a base R answer, aggregate will drop the missing values automatically (you can change this behaviour with the na.action argument if you want to). So:

aggregate( Price ~ Food   Year   Countries , mean , data=Data)

gives you:

    Food Year Countries     Price
1  Bread 2017     China  2.590000
2   Rice 2017     China 10.700000
3  Bread 2018     China  2.750000
4   Rice 2018     China 11.166667
5  Bread 2019     China  2.800000
6   Rice 2019     China 11.500000
7  Bread 2017     India  2.395000
8  Bread 2019     India  2.515000
9  Bread 2017   Vietnam  2.580000
10  Rice 2017   Vietnam 10.050000
11 Bread 2018   Vietnam  2.816667
12  Rice 2018   Vietnam 10.300000
13  Rice 2019   Vietnam 10.700000

If you want them in a different order just reorder the RHS of the formula.

CodePudding user response:

With dplyr:

Data %>%
  group_by(Countries, Year, Food) %>%
  summarise(Price = mean(Price, na.rm = TRUE), .groups = 'drop') %>%
  filter(!is.na(Price)) %>%
  arrange(Food, Countries, desc(Year))

#> # A tibble: 13 × 4
#>    Countries  Year Food  Price
#>    <fct>     <int> <fct> <dbl>
#>  1 China      2019 Bread  2.8 
#>  2 China      2018 Bread  2.75
#>  3 China      2017 Bread  2.59
#>  4 India      2019 Bread  2.51
#>  5 India      2017 Bread  2.40
#>  6 Vietnam    2018 Bread  2.82
#>  7 Vietnam    2017 Bread  2.58
#>  8 China      2019 Rice  11.5 
#>  9 China      2018 Rice  11.2 
#> 10 China      2017 Rice  10.7 
#> 11 Vietnam    2019 Rice  10.7 
#> 12 Vietnam    2018 Rice  10.3 
#> 13 Vietnam    2017 Rice  10.0

CodePudding user response:

Try this:

library(tidyverse)


Data <- structure(list(Countries = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("China", "India", "Vietnam"), class = "factor"), Year = c(2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2019L, 2018L, 2018L, 2017L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2018L, 2018L, 2018L, 2017L, 2017L, 2019L, 2019L, 2019L, 2018L, 2017L, 2017L), Food = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Bread", "Rice"), class = "factor"), Price = c(2.8, 2.8, 2.7, NA, 2.6, 2.58, 2.53, 2.5, NA, NA, 2.395, 2.9, 2.8, 2.75, 2.66, 2.5, 11.5, 11.3, 11.2, 11, NA, 10.7, 10.7, NA, NA, 10.3, 10.1, 10)), class = "data.frame", row.names = c(NA, -28L))

Data |> 
  group_by(Countries, Year, Food) |> 
  summarise(Price = round(mean(Price, na.rm = TRUE), 1)) |> 
  arrange(Food, Countries, desc(Year)) |> 
  filter(!is.nan(Price))
#> # A tibble: 13 × 4
#> # Groups:   Countries, Year [8]
#>    Countries  Year Food  Price
#>    <fct>     <int> <fct> <dbl>
#>  1 China      2019 Bread   2.8
#>  2 China      2018 Bread   2.8
#>  3 China      2017 Bread   2.6
#>  4 India      2019 Bread   2.5
#>  5 India      2017 Bread   2.4
#>  6 Vietnam    2018 Bread   2.8
#>  7 Vietnam    2017 Bread   2.6
#>  8 China      2019 Rice   11.5
#>  9 China      2018 Rice   11.2
#> 10 China      2017 Rice   10.7
#> 11 Vietnam    2019 Rice   10.7
#> 12 Vietnam    2018 Rice   10.3
#> 13 Vietnam    2017 Rice   10.1

Created on 2022-05-11 by the reprex package (v2.0.1)

CodePudding user response:

with data.table:

library(data.table)
setDT(Data)[!is.na(Price), .(Price=mean(Price,na.rm=T)), by=.(Countries,Year,Food)]

Output:

    Countries Year  Food     Price
 1:     China 2019 Bread  2.800000
 2:     China 2018 Bread  2.750000
 3:     China 2017 Bread  2.590000
 4:     India 2019 Bread  2.515000
 5:     India 2017 Bread  2.395000
 6:   Vietnam 2018 Bread  2.816667
 7:   Vietnam 2017 Bread  2.580000
 8:     China 2019  Rice 11.500000
 9:     China 2018  Rice 11.166667
10:     China 2017  Rice 10.700000
11:   Vietnam 2019  Rice 10.700000
12:   Vietnam 2018  Rice 10.300000
13:   Vietnam 2017  Rice 10.050000
  • Related