Home > Software design >  R categorize row based on dummy variables
R categorize row based on dummy variables

Time:03-02

I have a data set of ATM IDs that are coded with dummy variables that tells us whether the ATM is open or closed. The goal is to generate a new column (type) that categorizes each ATM based on its opening/closure behavior. In this data, 1 in the dummy variable tells us that the ATM is open, and 0 tells us that the ATMs is closed. Here is the data and expected output.

data <- tribble(
  ~atm_id, ~nov_2019,  ~feb_2020, ~may_2020, ~aug_2020, ~nov_2020, ~type,
  "xx1", 0,  0, 0, 0, 0,  "A", 
  "xx2", 0,  1, 1, 1, 1,  "B",
  "xx3", 0, 0, 1, 1, 1, "B", 
  "xx4", 0, 0, 0, 1, 1, "B",
  "xx5", 0, 1, 0, 1, 1, "C",
  "xx6", 0, 1, 0, 1, 0, "C"
)

I am trying to mutate the type variable and categorize each type of opening/closure behavior.

  • Type A - ATMs that closed in the first time period, and remained closed (all zeros)
  • Type B - ATMs that closed in the first time period, eventually reopened, and have stayed open so far.
  • Type C - ATMs that closed in the first time period, eventually reopened, and then closed again after reopening - i.e., (0, 1, 0, 1)

The month/year columns go up to 2022, and we will be adding more data later on, so the ideal the code is flexible to accommodate. However, these three are the basic types of opening/closure behaviors, and I need to capture them somehow using row-wise operations or some other method.

CodePudding user response:

You can use c_across with case_when.

  • In the first case, type is A if all values are 0 between nov_2019 and nov_2020
  • In the second case, type is B if there are two distinct consecutive values (using data.table::rleid).
  • Otherwise, type is C. This could be replaced by something like n_distinct(data.table::rleid(c_across(nov_2019:nov_2020))) > 2.
library(dplyr)
data %>% 
  rowwise() %>% 
  mutate(new = case_when(all(c_across(nov_2019:nov_2020) == 0) ~ "A",
                         n_distinct(data.table::rleid(c_across(nov_2019:nov_2020))) == 2 ~ "B",
                         T ~ "C"))

# A tibble: 6 x 8
# Rowwise: 
  atm_id nov_2019 feb_2020 may_2020 aug_2020 nov_2020 type  new  
  <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr> <chr>
1 xx1           0        0        0        0        0 A     A    
2 xx2           0        1        1        1        1 B     B    
3 xx3           0        0        1        1        1 B     B    
4 xx4           0        0        0        1        1 B     B    
5 xx5           0        1        0        1        1 C     C    
6 xx6           0        1        0        1        0 C     C    

CodePudding user response:

I’d approach this first by making a function that encodes the classification rules you want to enforce on a single vector.

library(dplyr, warn.conflicts = FALSE)

classify_atm <- function(is_open) {
  is_open <- as.logical(is_open)
  case_when(
    first(is_open) ~ NA_character_, # Not specified
    # Remained closed
    all(!is_open) ~ "A",
    # Reopened without any further closing
    all(is_open == cummax(is_open)) ~ "B",
    # Reopened, but closed again at some point -- essentially, others
    TRUE ~ "C",
  )  
}

# Test on some input vectors
classify_atm(c(0, 0, 0))
#> [1] "A"
classify_atm(c(0, 1, 1))
#> [1] "B"
classify_atm(c(0, 1, 0))
#> [1] "C"

Then, use rowwise() with c_across() to form the input for each row:

data <- tribble(
  ~atm_id, ~nov_2019,  ~feb_2020, ~may_2020, ~aug_2020, ~nov_2020, ~type,
  "xx1", 0,  0, 0, 0, 0,  "A", 
  "xx2", 0,  1, 1, 1, 1,  "B",
  "xx3", 0, 0, 1, 1, 1, "B", 
  "xx4", 0, 0, 0, 1, 1, "B",
  "xx5", 0, 1, 0, 1, 1, "C",
  "xx6", 0, 1, 0, 1, 0, "C"
)

data %>% 
  rowwise() %>% 
  mutate(
    new_type = classify_atm(c_across(nov_2019:nov_2020))
  )
#> # A tibble: 6 x 8
#> # Rowwise: 
#>   atm_id nov_2019 feb_2020 may_2020 aug_2020 nov_2020 type  new_type
#>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl> <chr> <chr>   
#> 1 xx1           0        0        0        0        0 A     A       
#> 2 xx2           0        1        1        1        1 B     B       
#> 3 xx3           0        0        1        1        1 B     B       
#> 4 xx4           0        0        0        1        1 B     B       
#> 5 xx5           0        1        0        1        1 C     C       
#> 6 xx6           0        1        0        1        0 C     C

CodePudding user response:

If atm is opened on first measured instance then type is NA:

f <- function(m){

  res <- rep(NA_character_, ncol(m))
  indices <- which(m[,1] == 0)

  res[indices] <- apply(
    apply(m[indices,], 1, diff),
    2,
    function(x) if(all(x == 0)) "A" else if (any(x == -1)) "C" else "B"
  )

  res
}

my_cols <- grep("_[0-9]{4}$", names(data), value = TRUE)
data$type <- f(data[,my_cols])
  • Related