I'm looking for an algorithm to create a new column based on values from other columns AND respecting pre-established rules. Here's an example:
artificial data
df = data.frame(
col_1 = c('No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'),
col_2 = c('Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'),
col_3 = c('Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown')
)
The goal is to create a new_column based on the values of col_1, col_2, and col_3. For that, the rules are:
- If the value 'Yes' is present in any of the columns, the value of the new_column will be 'Yes';
- If the value 'Yes' is not present in any of the columns, but the value 'No' is present, then the value of the new_column will be 'No';
- If the values 'Yes' and 'No' are absent, then the value of new_columns will be 'Unknown'.
I managed to operationalize this using case_when() describing all possible combinations; or ifelse sequential. But these solutions are not scalable to N variables.
Current solution:
library(dplyr)
df_1 <-
df %>%
mutate(
new_column = ifelse(
(col_1 == 'Yes' | col_2 == 'Yes' | col_3 == 'Yes'), 'Yes',
ifelse(
(col_1 == 'Unknown' & col_2 == 'Unknown' & col_3 == 'Unknown'), 'Unknown','No'
)
)
)
I'm looking for some algorithm capable of operationalizing this faster and capable of being expanded to N variables.
After searching for StackOverflow, I couldn't find a way to my problem (I know there are several posts about creating a new column based on values obtained from different columns, but none). Perhaps the search strategy was not the best. If anyone finds it, please provide the link.
I used R in the code, but the current solution works in Python using np.where. Solutions in R or Python are welcome.
CodePudding user response:
Try this using dplyr
rowwise
function
library(dplyr)
df |> rowwise() |> mutate(new_column = case_when(any(c_across() == "Yes") ~ "Yes" ,
any(c_across() == "No") ~ "No" , TRUE ~ "Unknown")) |> ungroup()
- output
# A tibble: 10 × 4
col_1 col_2 col_3 new_column
<chr> <chr> <chr> <chr>
1 No Yes Unknown Yes
2 Yes Yes Yes Yes
3 Yes Unknown Yes Yes
4 Yes Yes Unknown Yes
5 Yes Unknown Unknown Yes
6 Yes No No Yes
7 No Unknown No No
8 No No Unknown No
9 No Unknown Unknown No
10 Unknown Unknown Unknown Unknown
- data
df <- structure(list(col_1 = c("No", "Yes", "Yes", "Yes", "Yes", "Yes",
"No", "No", "No", "Unknown"), col_2 = c("Yes", "Yes", "Unknown",
"Yes", "Unknown", "No", "Unknown", "No", "Unknown", "Unknown"
), col_3 = c("Unknown", "Yes", "Yes", "Unknown", "Unknown", "No",
"No", "Unknown", "Unknown", "Unknown")), class = "data.frame", row.names = c(NA,
-10L))
CodePudding user response:
A possible solution:
library(dplyr)
df %>%
mutate(col = if_else(if_any(col_1:col_3, ~ .x == "Yes"), "Yes",
if_else(if_any(col_1:col_3, ~ .x == "No"), "No", "Unknown")))
#> col_1 col_2 col_3 col
#> 1 No Yes Unknown Yes
#> 2 Yes Yes Yes Yes
#> 3 Yes Unknown Yes Yes
#> 4 Yes Yes Unknown Yes
#> 5 Yes Unknown Unknown Yes
#> 6 Yes No No Yes
#> 7 No Unknown No No
#> 8 No No Unknown No
#> 9 No Unknown Unknown No
#> 10 Unknown Unknown Unknown Unknown
CodePudding user response:
A solution using Python:
import pandas as pd
df = pd.DataFrame({
'col_1': ['No','Yes','Yes','Yes','Yes','Yes','No','No','No','Unknown'],
'col_2': ['Yes','Yes','Unknown','Yes','Unknown','No','Unknown','No','Unknown','Unknown'],
'col_3': ['Unknown','Yes','Yes','Unknown','Unknown','No','No','Unknown','Unknown','Unknown']
})
df['col_4'] = [('Yes' if 'Yes' in x else ('No' if 'No' in x else 'Unknown')) for x in zip(df['col_1'], df['col_2'], df['col_3'])]
print(df)
Output:
col_1 col_2 col_3 col_4
0 No Yes Unknown Yes
1 Yes Yes Yes Yes
2 Yes Unknown Yes Yes
3 Yes Yes Unknown Yes
4 Yes Unknown Unknown Yes
5 Yes No No Yes
6 No Unknown No No
7 No No Unknown No
8 No Unknown Unknown No
9 Unknown Unknown Unknown Unknown