I've recently picked up R programming and have been looking through some group_by/aggregate questions posted here to help me learn better. A question came to my mind earlier today on how group_by/aggregate can incorporate NA data rather than 0.
Given the table and code below (credits to max_lim for allowing me to use his data set), what happens if the field of NA exist (which does happen quite often)?
Farms = c(rep("Farm 1", 6), rep("Farm 2", 6), rep("Farm 3", 6))
Year = rep(c(2020,2020,2019,2019,2018,2018),3)
Cow = c(22,NA,16,12,8,NA,31,NA,3,20,39,34,27,50,NA,NA,NA,NA)
Duck = c(12,12,6,NA,NA,NA,28,13,31,50,33,20,NA,9,19,2,NA,7)
Chicken = c(100,120,80,50,NA,10,27,31,NA,43,NA,28,37,NA,NA,NA,5,43)
Sheep = c(30,20,10,NA,16,13,10,20,20,17,48,12,30,NA,20,NA,27,49)
Horse = c(25,20,16,11,NA,12,14,NA,43,42,10,12,42,NA,16,7,NA,42)
Data = data.frame(Farms, Year, Cow, Duck, Chicken, Sheep, Horse)
Farm | Year | Cow | Duck | Chicken | Sheep | Horse |
---|---|---|---|---|---|---|
Farm 1 | 2020 | 22 | 12 | 100 | 30 | 25 |
Farm 1 | 2020 | NA | 12 | 120 | 20 | 20 |
Farm 1 | 2019 | 16 | 6 | 80 | 10 | 16 |
Farm 1 | 2019 | 12 | NA | 50 | NA | 11 |
Farm 1 | 2018 | 8 | NA | NA | 16 | NA |
Farm 1 | 2018 | NA | NA | 10 | 13 | 12 |
Farm 2 | 2020 | 31 | 28 | 27 | 10 | 14 |
Farm 2 | 2020 | NA | 13 | 31 | 20 | NA |
Farm 2 | 2019 | 3 | 31 | NA | 20 | 43 |
Farm 2 | 2019 | 20 | 50 | 43 | 17 | 42 |
Farm 2 | 2018 | 39 | 33 | NA | 48 | 10 |
Farm 2 | 2018 | 34 | 20 | 28 | 12 | 12 |
Farm 3 | 2020 | 27 | NA | 37 | 30 | 42 |
Farm 3 | 2020 | 50 | 9 | NA | NA | NA |
Farm 3 | 2019 | NA | 19 | NA | 20 | 16 |
Farm 3 | 2019 | NA | 2 | NA | NA | 7 |
Farm 3 | 2018 | NA | NA | 5 | 27 | NA |
Farm 3 | 2018 | NA | 7 | 43 | 49 | 42 |
If I were to use aggregate(.~Farms Year, Data, mean) here, I would get Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate which I assume is because the mean function isn't able to account for NA.
Does anyone know how we can modify the aggregate/group_by function to account for the NA by calculating the average using only years without NA data? i.e. 2020: 10, 2019: NA, 2018:20, 2017:NA, 2016:15 -> the average (after discounting NA years 2019 and 2017) will be (10 20 15) / (3) = 15.
The ideal output will be as follow:
Farm | Year | Cow | Duck | Chicken | Sheep | Horse |
---|---|---|---|---|---|---|
Farm 1 | 2020 | 22 (avg = 22/1 as one entry is NA) | 12 | 110 | 25 | 22.5 |
Farm 1 | 2019 | 14 | 6 | 65 | 10 | 13.5 |
Farm 1 | 2018 | 8 | N.A. (as it's all NA) | 10 | 14.5 | 12 |
Farm 2 | 2020 | 31 | 20.5 | 29 | 15 | 14 |
Farm 2 | 2019 | 11.5 | 40.5 | 43 | 18.5 | 42.5 |
Farm 2 | 2018 | 36.5 | 26.5 | 28 | 30 | 11 |
Farm 3 | 2020 | ... | ... | ... | ... | ... |
Farm 3 | 2019 | ... | ... | ... | ... | ... |
Farm 3 | 2018 | ... | ... | ... | ... | ... |
CodePudding user response:
Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0