Home > Software engineering >  Create a 3-level indicator variable based on counts of unique column items
Create a 3-level indicator variable based on counts of unique column items

Time:05-31

Background

I've got a dataframe df:

df <- data.frame(ID =    c("a","a","a","b","b","b", "c","c","c","c","c","d","d","d","d"),
                 event = c("red",NA,"blue","white",NA,"white","orange",NA,"gray",NA,"red","black","magenta","purple","brown"),
                 stringsAsFactors=FALSE)

> df
   ID   event
1   a     red
2   a    <NA>
3   a    blue
4   b   white
5   b    <NA>
6   b   white
7   c  orange
8   c    <NA>
9   c    gray
10  c    <NA>
11  c     red
12  d   black
13  d magenta
14  d  purple
15  d   brown

It's got 4 people (ID) in it across 15 rows, each with various event entries that are unique within that person: for example ID=a has 2 unique events, red and blue, while ID=b has only 1 unique event, white, even though there's 2 of "white".

The Problem

I'd like to make a new column, indicator, that counts the number of unique non-NA events within each ID and sorts them into levels:

  • 0 if there's only one unique non-NA event for that person
  • 1 for 2 unique non-NA events
  • 2 for 3 or more unique non-NA events

The level indicator should be applied to every row of indicator for each ID, including event=NA rows -- it can't just be indicator= 2, for example, when there's something non-NA in event.

In other words I'm looking for something like this:

# A tibble: 15 x 3
   ID    event   indicator
   <chr> <chr>       <int>
 1 a     red             1
 2 a     NA              1
 3 a     blue            1
 4 b     white           0
 5 b     NA              0
 6 b     white           0
 7 c     orange          2
 8 c     NA              2
 9 c     gray            2
10 c     NA              2
11 c     red             2
12 d     black           2
13 d     magenta         2
14 d     purple          2
15 d     brown           2

What I've tried

I've started out by trying to just build the middle level to see if I can get ID=a equaling 1, but it's doing so for ID=b instead:

df <- df %>% 
  group_by(ID) %>%
  mutate(indicator = if_else((n_distinct(event)) == 2, 1, 0)) %>%
  ungroup()

# A tibble: 15 x 3
   ID    event   indicator
   <chr> <chr>       <dbl>
 1 a     red             0
 2 a     NA              0
 3 a     blue            0
 4 b     white           1
 5 b     NA              1
 6 b     white           1
 7 c     orange          0
 8 c     NA              0
 9 c     gray            0
10 c     NA              0
11 c     red             0
12 d     black           0
13 d     magenta         0
14 d     purple          0
15 d     brown           0

Why does it think 2 instances of "white" is 2 distinct events? Or is n_distinct not what I think it is?

CodePudding user response:

You can use:

library(dplyr)

df %>%
  group_by(ID) %>%
  mutate(x = pmax(pmin(n_distinct(event, na.rm = TRUE), 3) - 1, 0)) %>%
  ungroup()

# A tibble: 15 × 3
   ID    event       x
   <chr> <chr>   <dbl>
 1 a     red         1
 2 a     NA          1
 3 a     blue        1
 4 b     white       0
 5 b     NA          0
 6 b     white       0
 7 c     orange      2
 8 c     NA          2
 9 c     gray        2
10 c     NA          2
11 c     red         2
12 d     black       2
13 d     magenta     2
14 d     purple      2
15 d     brown       2

You could remove pmax() if you're confident that no groups will consist only of NA. To answer your question, you need to tell n_distinct() to omit NA values, otherwise they will be counted.

CodePudding user response:

With data.table and fcase:

library(data.table)

setDT(df)
df[,indicator:=fcase(sum(!is.na(unique(event)))<=1,0,sum(!is.na(unique(event)))==2,1,default=2),by=ID][]

    ID   event indicator
 1:  a     red         1
 2:  a    <NA>         1
 3:  a    blue         1
 4:  b   white         0
 5:  b    <NA>         0
 6:  b   white         0
 7:  c  orange         2
 8:  c    <NA>         2
 9:  c    gray         2
10:  c    <NA>         2
11:  c     red         2
12:  d   black         2
13:  d magenta         2
14:  d  purple         2
15:  d   brown         2

CodePudding user response:

Another possible solution:

library(dplyr)

df %>% 
  group_by(ID) %>% 
  mutate(indicator = n_distinct(event, na.rm = T) %>% 
           {case_when(. == 1 ~ 0, . == 2 ~ 1, . >= 3 ~ 2)}) %>% ungroup

#> # A tibble: 15 × 3
#>    ID    event   indicator
#>    <chr> <chr>       <dbl>
#>  1 a     red             1
#>  2 a     <NA>            1
#>  3 a     blue            1
#>  4 b     white           0
#>  5 b     <NA>            0
#>  6 b     white           0
#>  7 c     orange          2
#>  8 c     <NA>            2
#>  9 c     gray            2
#> 10 c     <NA>            2
#> 11 c     red             2
#> 12 d     black           2
#> 13 d     magenta         2
#> 14 d     purple          2
#> 15 d     brown           2
  • Related