I have a simple data frame consisting of (account) IDs and default rates for five years. Many default rates are missing. The data can be generated as follows:
ID = rep(1:50, each = 5)
def= rnorm(n=250, mean=0.5, sd=0.2)
ind= which(def %in% sample(def, 100))
def[ind] = NA
df = data.frame(ID = ID, Def = def)
And looks like this:
> head(df, 20)
ID Def
1 1 0.39506938
2 1 NA
3 1 0.42946603
4 1 NA
5 1 NA
6 2 0.45125199
7 2 0.40519126
8 2 NA
9 2 0.65082718
10 2 NA
11 3 NA
12 3 0.46132736
13 3 0.06324983
14 3 0.72630862
15 3 0.63996092
16 4 0.72093890
17 4 NA
18 4 NA
19 4 0.61471461
20 4 0.51788498
How can show the ID numbers for which at least 4 of the 5 default rates are not NAs?
CodePudding user response:
You may try
library(dplyr)
df %>%
group_by(ID) %>%
dplyr::summarize(p = sum(!is.na(Def))/n()) %>%
filter(p >= 0.8) %>% # or > 0.8?
pull(ID)
[1] 2 8 9 11 13 17 20 23 25 27 28 29 33 38 44 45 47 49
CodePudding user response:
If you need to keep the data frame you can do:
library(dplyr)
df |>
group_by(ID) |>
mutate(m = sum(is.na(Def))) |>
filter(m <= 1) |>
select(ID, Def)
EDIT:
Can be further simplified
df |>
group_by(ID) |>
filter(sum(is.na(Def)) <= 1)
CodePudding user response:
Here is an alternative:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(x = 5 - sum(is.na(Def))) %>%
slice(1) %>%
filter(x >=4) %>%
pull(ID)