Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...-CodePudding

I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.

Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?

Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)

Farm	Year	Cow	Duck	Chicken	Sheep	Horse
Farm 1	2020	22	12	100	30	25
Farm 1	2020	NA	12	120	20	20
Farm 1	2019	16	6	80	10	16
Farm 1	2019	12	NA	50	NA	11
Farm 1	2018	8	NA	NA	16	NA
Farm 1	2018	NA	NA	10	13	12
Farm 2	2020	31	28	27	10	14
Farm 2	2020	NA	13	31	20	NA
Farm 2	2019	3	31	NA	20	43
Farm 2	2019	20	50	43	17	42
Farm 2	2018	39	33	NA	48	10
Farm 2	2018	34	20	28	12	12
Farm 3	2020	27	NA	37	30	42
Farm 3	2020	50	9	NA	NA	NA
Farm 3	2019	NA	19	NA	20	16
Farm 3	2019	NA	2	NA	NA	7
Farm 3	2018	NA	NA	5	27	NA
Farm 3	2018	NA	7	43	49	42

If I were to use aggregate(.~Farms Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.

Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e. 2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 20 15) / (3) = 15.

The ideal output will be as follow:

Farm	Year	Cow	Duck	Chicken	Sheep	Horse
Farm 1	2020	22 (avg = 22/1 as one entry is NA)	12	110	25	22.5
Farm 1	2019	14	6	65	10	13.5
Farm 1	2018	8	N.A. (as it's all NA)	10	14.5	12
Farm 2	2020	31	20.5	29	15	14
Farm 2	2019	11.5	40.5	43	18.5	42.5
Farm 2	2018	36.5	26.5	28	30	11
Farm 3	2020	...	...	...	...	...
Farm 3	2019	...	...	...	...	...
Farm 3	2018	...	...	...	...	...

CodePudding user response：

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.

library(dplyr)

Using aggregate

 Data %>% 
  aggregate(.~Year Farms,., FUN=mean, na.rm=T, na.action=NULL) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>%  
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Using summarize

Data %>% 
  group_by(Year, Farms) %>% 
  summarize(MeanCow = mean(Cow, na.rm=T),
            MeanDuck =  mean(Duck, na.rm=T),
            MeanChicken = mean(Chicken, na.rm=T),
            MeanSheep = mean(Sheep, na.rm=T),
            MeanHorse = mean(Horse, na.rm=T)) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>% 
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Solution for both

      Year  Farms  Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0     110  25.0  22.5
2 2019 Farm 1 14.0  6.0      65  10.0  13.5
3 2018 Farm 1  8.0   NA      10  14.5  12.0
4 2020 Farm 2 31.0 20.5      29  15.0  14.0
5 2019 Farm 2 11.5 40.5      43  18.5  42.5
6 2018 Farm 2 36.5 26.5      28  30.0  11.0
7 2020 Farm 3 38.5  9.0      37  30.0  42.0
8 2019 Farm 3   NA 10.5      NA  20.0  11.5
9 2018 Farm 3   NA  7.0      24  38.0  42.0