Home > OS >  Copy columns in a dataframe based on another dataframe in R
Copy columns in a dataframe based on another dataframe in R

Time:08-25

I have a large dataframe with columns that can have up to 4 options, and I'm trying to turn it into a true/false table with a separate column for each option. ie:

Sample <various_columns>   Thing1  Thing2 Thing3
11111  <various_data>      N       B      D
22222  <various_data>      A       C      D
33333  <various_data>      N       C      D
44444  <various_data>      N       N      E

would look like:

Sample <various_columns>  Thing1  Thing2 Thing3 Thing1_A  Thing2_B  Thing2_C Thing3_D  Thing3_E
11111  <variousdata>       N         B     D       0         1         0       1         0
22222  <variousdata>       A         C     D       1         0         1       1         0
33333  <variousdata>       N         C     D       0         0         1       1         0
44444  <variousdata>       N         N     E       0         0         0       0         1

I thought of making another df that had the "key" ie.

New_column  Original_column  Positive_result
Thing1_A    Thing1           A
Thing2_B    Thing2           B
Thing2_C    Thing2           C
Thing3_D    Thing3           D
Thing3_E    Thing3           E

and then using this to duplicate the columns and code what should be "1" and what should be "0", but I couldn't figure out how to use the second df to add the correct columns etc to the first df. But there's probably also a better way to do this completely. Any ideas?

CodePudding user response:

cbind(df, model.matrix(~.,lapply(df[-1], \(x)relevel(factor(x), 'N')))[,-1])

  Sample Thing1 Thing2 Thing1A Thing2B Thing2C
1  11111      N      B       0       1       0
2  22222      A      C       1       0       1
3  33333      N      C       0       0       1
4  44444      N      N       0       0       0

CodePudding user response:

Here is a possible tidyverse option, where we pivot to long format, count, then pivot back to wide format, and finally back join to the original dataframe.

library(tidyverse)

df %>%
  select(Sample, Thing1:Thing3) %>%
  pivot_longer(-Sample) %>%
  filter(value != "N") %>%
  mutate(col_names = paste(name, value, sep = "_")) %>%
  group_by(Sample, col_names) %>%
  count() %>%
  arrange(col_names) %>%
  pivot_wider(names_from = "col_names",
              values_from = "n",
              values_fill = 0) %>%
  left_join(df, ., by = "Sample")

Output

  Sample various_columns Thing1 Thing2 Thing3 Thing1_A Thing2_B Thing2_C Thing3_D Thing3_E
1  11111    various_data      N      B      D        0        1        0        1        0
2  22222    various_data      A      C      D        1        0        1        1        0
3  33333    various_data      N      C      D        0        0        1        1        0
4  44444    various_data      N      N      E        0        0        0        0        1

Data

df <- structure(list(Sample = c(11111L, 22222L, 33333L, 44444L), various_columns = c("various_data", 
"various_data", "various_data", "various_data"), Thing1 = c("N", 
"A", "N", "N"), Thing2 = c("B", "C", "C", "N"), Thing3 = c("D", 
"D", "D", "E")), class = "data.frame", row.names = c(NA, -4L))
  •  Tags:  
  • r
  • Related