New column based on values from other columns AND respecting pre-established rules-CodePudding

I'm looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here's an example:

artificial data

df = data.frame(
  col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
  col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
  col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)

The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:

If the value 'Yes' is present in any of the columns, the value of the new_column will be 'Yes';
If the value 'Yes' is not present in any of the columns, but the value 'No' is present, then the value of the new_column will be 'No';
If the values 'Yes' and 'No' are absent, then the value of new_columns will be 'Unknown'.

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

Current solution:

library(dplyr)
df_1 <-
  df %>%
  mutate(
    new_column = ifelse(
      (col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
      ifelse(
        (col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
        )
      )
    )

I'm looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

After searching for StackOverflow, I couldn't find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.

I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.

CodePudding user response：

Try this using dplyr rowwise function

library(dplyr)

df |> rowwise() |> mutate(new_column = case_when(any(c_across() == "Yes") ~ "Yes" ,
any(c_across() == "No") ~ "No" , TRUE ~ "Unknown")) |> ungroup()

output

# A tibble: 10 × 4
   col_1   col_2   col_3   new_column
   <chr>   <chr>   <chr>   <chr>     
 1 No      Yes     Unknown Yes       
 2 Yes     Yes     Yes     Yes       
 3 Yes     Unknown Yes     Yes       
 4 Yes     Yes     Unknown Yes       
 5 Yes     Unknown Unknown Yes       
 6 Yes     No      No      Yes       
 7 No      Unknown No      No        
 8 No      No      Unknown No        
 9 No      Unknown Unknown No        
10 Unknown Unknown Unknown Unknown

data

df <- structure(list(col_1 = c("No", "Yes", "Yes", "Yes", "Yes", "Yes", 
"No", "No", "No", "Unknown"), col_2 = c("Yes", "Yes", "Unknown", 
"Yes", "Unknown", "No", "Unknown", "No", "Unknown", "Unknown"
), col_3 = c("Unknown", "Yes", "Yes", "Unknown", "Unknown", "No", 
"No", "Unknown", "Unknown", "Unknown")), class = "data.frame", row.names = c(NA, 
-10L))

CodePudding user response：

A possible solution:

library(dplyr)

df %>% 
  mutate(col = if_else(if_any(col_1:col_3, ~ .x == "Yes"), "Yes", 
    if_else(if_any(col_1:col_3, ~ .x == "No"), "No", "Unknown")))

#>      col_1   col_2   col_3     col
#> 1       No     Yes Unknown     Yes
#> 2      Yes     Yes     Yes     Yes
#> 3      Yes Unknown     Yes     Yes
#> 4      Yes     Yes Unknown     Yes
#> 5      Yes Unknown Unknown     Yes
#> 6      Yes      No      No     Yes
#> 7       No Unknown      No      No
#> 8       No      No Unknown      No
#> 9       No Unknown Unknown      No
#> 10 Unknown Unknown Unknown Unknown

CodePudding user response：

A solution using Python:

import pandas as pd

df = pd.DataFrame({
  'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
  'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
  'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
})

df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]

print(df)

Output:

     col_1    col_2    col_3    col_4
0       No      Yes  Unknown      Yes
1      Yes      Yes      Yes      Yes
2      Yes  Unknown      Yes      Yes
3      Yes      Yes  Unknown      Yes
4      Yes  Unknown  Unknown      Yes
5      Yes       No       No      Yes
6       No  Unknown       No       No
7       No       No  Unknown       No
8       No  Unknown  Unknown       No
9  Unknown  Unknown  Unknown  Unknown

artificial data

The goal is to create a new_column based on the values ​​of col_1, col_2, and col_3. For that, the rules are:

I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.

I'm looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.

The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are: