Home > other >  How to find out which situation has the most missing data
How to find out which situation has the most missing data

Time:10-24

I have a dataset like this (df)

ID Community Time BIP EXP
1 1 BF 12000 500
1 1 DF NA NA
1 1 AF 8000 NA
2 1 BF 13000 300
2 1 DF 12000 200
2 1 AF 11000 120

This df has 40'000 observations. I would like to find out which of these situation in the time column (BF= Before financial crisis; DF = during finance crisis; AF= After Finance Crisis) has the most missing data in all the columns BIP and EXP?

I gave the following code to find out how many missing data are in BIP and EXP

sapply(df, function(x) sum(is.na(x)))

It shows that BIP has 55 missing data and EXP has 34 missing data. But no information for the time situation.

Could someone please help?

CodePudding user response:

You can use aggregate:

aggregate(cbind(BIP = is.na(BIP), EXP = is.na(EXP)) ~ Time, data = dat, sum)
#Also works
#with(dat, aggregate(list(BIP = BIP, EXP = EXP), list(Time = Time), function(x) sum(is.na(x))))

  Time BIP EXP
1   AF   0   1
2   BF   0   0
3   DF   1   1

Or in dplyr:

library(dplyr)
dat %>% 
  group_by(Time) %>% 
  summarise(across(BIP:EXP, ~ sum(is.na(.x))))

  Time    BIP   EXP
1 AF        0     1
2 BF        0     0
3 DF        1     1
  •  Tags:  
  • r
  • Related