I have a large dataframe with columns that can have up to 4 options, and I'm trying to turn it into a true/false table with a separate column for each option. ie:
Sample <various_columns> Thing1 Thing2 Thing3
11111 <various_data> N B D
22222 <various_data> A C D
33333 <various_data> N C D
44444 <various_data> N N E
would look like:
Sample <various_columns> Thing1 Thing2 Thing3 Thing1_A Thing2_B Thing2_C Thing3_D Thing3_E
11111 <variousdata> N B D 0 1 0 1 0
22222 <variousdata> A C D 1 0 1 1 0
33333 <variousdata> N C D 0 0 1 1 0
44444 <variousdata> N N E 0 0 0 0 1
I thought of making another df that had the "key" ie.
New_column Original_column Positive_result
Thing1_A Thing1 A
Thing2_B Thing2 B
Thing2_C Thing2 C
Thing3_D Thing3 D
Thing3_E Thing3 E
and then using this to duplicate the columns and code what should be "1" and what should be "0", but I couldn't figure out how to use the second df to add the correct columns etc to the first df. But there's probably also a better way to do this completely. Any ideas?
CodePudding user response:
cbind(df, model.matrix(~.,lapply(df[-1], \(x)relevel(factor(x), 'N')))[,-1])
Sample Thing1 Thing2 Thing1A Thing2B Thing2C
1 11111 N B 0 1 0
2 22222 A C 1 0 1
3 33333 N C 0 0 1
4 44444 N N 0 0 0
CodePudding user response:
Here is a possible tidyverse
option, where we pivot to long format, count, then pivot back to wide format, and finally back join to the original dataframe.
library(tidyverse)
df %>%
select(Sample, Thing1:Thing3) %>%
pivot_longer(-Sample) %>%
filter(value != "N") %>%
mutate(col_names = paste(name, value, sep = "_")) %>%
group_by(Sample, col_names) %>%
count() %>%
arrange(col_names) %>%
pivot_wider(names_from = "col_names",
values_from = "n",
values_fill = 0) %>%
left_join(df, ., by = "Sample")
Output
Sample various_columns Thing1 Thing2 Thing3 Thing1_A Thing2_B Thing2_C Thing3_D Thing3_E
1 11111 various_data N B D 0 1 0 1 0
2 22222 various_data A C D 1 0 1 1 0
3 33333 various_data N C D 0 0 1 1 0
4 44444 various_data N N E 0 0 0 0 1
Data
df <- structure(list(Sample = c(11111L, 22222L, 33333L, 44444L), various_columns = c("various_data",
"various_data", "various_data", "various_data"), Thing1 = c("N",
"A", "N", "N"), Thing2 = c("B", "C", "C", "N"), Thing3 = c("D",
"D", "D", "E")), class = "data.frame", row.names = c(NA, -4L))