Home > database >  Return IDs for which a column satisfies a condition
Return IDs for which a column satisfies a condition

Time:06-21

I have a simple data frame consisting of (account) IDs and default rates for five years. Many default rates are missing. The data can be generated as follows:

ID = rep(1:50, each = 5)

def= rnorm(n=250, mean=0.5, sd=0.2)

ind= which(def %in% sample(def, 100))

def[ind] = NA

df = data.frame(ID = ID, Def = def)

And looks like this:

> head(df, 20)
   ID        Def
1   1 0.39506938
2   1         NA
3   1 0.42946603
4   1         NA
5   1         NA
6   2 0.45125199
7   2 0.40519126
8   2         NA
9   2 0.65082718
10  2         NA
11  3         NA
12  3 0.46132736
13  3 0.06324983
14  3 0.72630862
15  3 0.63996092
16  4 0.72093890
17  4         NA
18  4         NA
19  4 0.61471461
20  4 0.51788498

How can show the ID numbers for which at least 4 of the 5 default rates are not NAs?

CodePudding user response:

You may try

library(dplyr)
df %>%
  group_by(ID) %>%
  dplyr::summarize(p = sum(!is.na(Def))/n()) %>%
  filter(p >= 0.8) %>% # or > 0.8?
  pull(ID)
[1]  2  8  9 11 13 17 20 23 25 27 28 29 33 38 44 45 47 49

CodePudding user response:

If you need to keep the data frame you can do:


library(dplyr)

df |> 
  group_by(ID) |> 
  mutate(m = sum(is.na(Def))) |> 
  filter(m <= 1) |> 
  select(ID, Def)

EDIT:

Can be further simplified

df |> 
  group_by(ID) |> 
  filter(sum(is.na(Def)) <= 1)

CodePudding user response:

Here is an alternative:

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(x = 5 - sum(is.na(Def))) %>% 
  slice(1) %>% 
  filter(x >=4) %>% 
  pull(ID)
  • Related