I have a dataset like this (df)
ID | Community | Time | BIP | EXP |
---|---|---|---|---|
1 | 1 | BF | 12000 | 500 |
1 | 1 | DF | NA | NA |
1 | 1 | AF | 8000 | NA |
2 | 1 | BF | 13000 | 300 |
2 | 1 | DF | 12000 | 200 |
2 | 1 | AF | 11000 | 120 |
This df has 40'000 observations. I would like to find out which of these situation in the time column (BF= Before financial crisis; DF = during finance crisis; AF= After Finance Crisis) has the most missing data in all the columns BIP and EXP?
I gave the following code to find out how many missing data are in BIP and EXP
sapply(df, function(x) sum(is.na(x)))
It shows that BIP has 55 missing data and EXP has 34 missing data. But no information for the time situation.
Could someone please help?
CodePudding user response:
You can use aggregate
:
aggregate(cbind(BIP = is.na(BIP), EXP = is.na(EXP)) ~ Time, data = dat, sum)
#Also works
#with(dat, aggregate(list(BIP = BIP, EXP = EXP), list(Time = Time), function(x) sum(is.na(x))))
Time BIP EXP
1 AF 0 1
2 BF 0 0
3 DF 1 1
Or in dplyr
:
library(dplyr)
dat %>%
group_by(Time) %>%
summarise(across(BIP:EXP, ~ sum(is.na(.x))))
Time BIP EXP
1 AF 0 1
2 BF 0 0
3 DF 1 1