Home > front end >  Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...
Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...

Time:01-10

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.

Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?

Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm Year Cow Duck Chicken Sheep Horse
Farm 1 2020 22 12 100 30 25
Farm 1 2020 NA 12 120 20 20
Farm 1 2019 16 6 80 10 16
Farm 1 2019 12 NA 50 NA 11
Farm 1 2018 8 NA NA 16 NA
Farm 1 2018 NA NA 10 13 12
Farm 2 2020 31 28 27 10 14
Farm 2 2020 NA 13 31 20 NA
Farm 2 2019 3 31 NA 20 43
Farm 2 2019 20 50 43 17 42
Farm 2 2018 39 33 NA 48 10
Farm 2 2018 34 20 28 12 12
Farm 3 2020 27 NA 37 30 42
Farm 3 2020 50 9 NA NA NA
Farm 3 2019 NA 19 NA 20 16
Farm 3 2019 NA 2 NA NA 7
Farm 3 2018 NA NA 5 27 NA
Farm 3 2018 NA 7 43 49 42

If I were to use aggregate(.~Farms Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.

Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e. 2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 20 15) / (3) = 15.

The ideal output will be as follow:

Farm Year Cow Duck Chicken Sheep Horse
Farm 1 2020 22 (avg = 22/1 as one entry is NA) 12 110 25 22.5
Farm 1 2019 14 6 65 10 13.5
Farm 1 2018 8 N.A. (as it's all NA) 10 14.5 12
Farm 2 2020 31 20.5 29 15 14
Farm 2 2019 11.5 40.5 43 18.5 42.5
Farm 2 2018 36.5 26.5 28 30 11
Farm 3 2020 ... ... ... ... ...
Farm 3 2019 ... ... ... ... ...
Farm 3 2018 ... ... ... ... ...

CodePudding user response:

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.

library(dplyr)

Using aggregate

 Data %>% 
  aggregate(.~Year Farms,., FUN=mean, na.rm=T, na.action=NULL) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>%  
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Using summarize

Data %>% 
  group_by(Year, Farms) %>% 
  summarize(MeanCow = mean(Cow, na.rm=T),
            MeanDuck =  mean(Duck, na.rm=T),
            MeanChicken = mean(Chicken, na.rm=T),
            MeanSheep = mean(Sheep, na.rm=T),
            MeanHorse = mean(Horse, na.rm=T)) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>% 
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Solution for both

      Year  Farms  Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0     110  25.0  22.5
2 2019 Farm 1 14.0  6.0      65  10.0  13.5
3 2018 Farm 1  8.0   NA      10  14.5  12.0
4 2020 Farm 2 31.0 20.5      29  15.0  14.0
5 2019 Farm 2 11.5 40.5      43  18.5  42.5
6 2018 Farm 2 36.5 26.5      28  30.0  11.0
7 2020 Farm 3 38.5  9.0      37  30.0  42.0
8 2019 Farm 3   NA 10.5      NA  20.0  11.5
9 2018 Farm 3   NA  7.0      24  38.0  42.0
  •  Tags:  
  • Related