Home > Software design >  Create a new column using dplyr based on string values in all other columns in a data frame in R
Create a new column using dplyr based on string values in all other columns in a data frame in R

Time:06-17

I have a data frame, my_df:

my_df <- structure(list(C1 = c("A", "X", "X", "A", "A"), F2 = c("A", "A", 
"A", "A", "A"), T3 = c("A", "A", "X", "X", "A"), S4 = c("A", 
"A", "A", "A", "X"), B5 = c("A", "A", "A", "A", "A")), class = "data.frame", row.names = c("ID1", 
"ID2", "ID3", "ID4", "ID5"))

> my_df
    C1 F2 T3 S4 B5
ID1  A  A  A  A  A
ID2  X  A  A  A  A
ID3  X  A  X  A  A
ID4  A  A  X  A  A
ID5  A  A  A  X  A

I want to create a new column, new_col, that says "same" if all values in all other columns are identical, otherwise it says "diff". I.e., the resulting data frame would look like:

> my_df
    C1 F2 T3 S4 B5 new_col
ID1  A  A  A  A  A    same
ID2  X  A  A  A  A    diff
ID3  X  A  X  A  A    diff
ID4  A  A  X  A  A    diff
ID5  A  A  A  X  A    diff

What is the best way to achieve this using dplyr?

CodePudding user response:

library(tidyverse)
my_df <- structure(list(C1 = c("A", "X", "X", "A", "A"),
                        F2 = c("A", "A", "A", "A", "A"),
                        T3 = c("A", "A", "X", "X", "A"),
                        S4 = c("A", "A", "A", "A", "X"),
                        B5 = c("A", "A", "A", "A", "A")),
                   class = "data.frame",
                   row.names = c("ID1","ID2", "ID3", "ID4", "ID5"))
my_df %>% 
  rowwise() %>% 
  mutate(new_col = if_else(
    length(unique(c_across())) == 1,
    "same",
    "diff"
  ))
#> # A tibble: 5 × 6
#> # Rowwise: 
#>   C1    F2    T3    S4    B5    new_col
#>   <chr> <chr> <chr> <chr> <chr> <chr>  
#> 1 A     A     A     A     A     same   
#> 2 X     A     A     A     A     diff   
#> 3 X     A     X     A     A     diff   
#> 4 A     A     X     A     A     diff   
#> 5 A     A     A     X     A     diff

CodePudding user response:

There are several ways to do this. One is to check if each value equals the first one:

#base R
my_df$new_col <- ifelse(rowSums(my_df == my_df[, 1]) == ncol(my_df), "same", "diff")
my_df$new_col <- ifelse(sapply(my_df, identical, my_df[, 1]), "same", "diff")

#dplyr
my_df %>% 
  dplyr::mutate(new_col = ifelse(rowSums(. == .[, 1]) == ncol(.), "same", "diff"))

    C1 F2 T3 S4 B5 new_col
ID1  A  A  A  A  A    same
ID2  X  A  A  A  A    diff
ID3  X  A  X  A  A    diff
ID4  A  A  X  A  A    diff
ID5  A  A  A  X  A    diff

You can also check if the length of unique values per row is 1:

apply(my_df, 1, function(x) length(unique(x)) == 1)
#apply(my_df, 1, function(x) dplyr::n_distinct(x) == 1)

CodePudding user response:

data.table option using uniqueN:

library(data.table)
setDT(my_df)[, new_col := c("diff", "same")[(uniqueN(unlist(.SD)) == 1)   1], 1:nrow(my_df)]
my_df

Output:

   C1 F2 T3 S4 B5 new_col
1:  A  A  A  A  A    same
2:  X  A  A  A  A    diff
3:  X  A  X  A  A    diff
4:  A  A  X  A  A    diff
5:  A  A  A  X  A    diff
  • Related