Creating binary data from text values in a column in R-CodePudding

I have a really large data set that I need to create new columns for based on text strings from other columns, and put them in a binary format. I have a person ID column and a set of 99 "Diagnosis Codes" that contain a text string that corresponds to a particular health condition.

Sample of Original Data

Person ID	Diagnosis Code 1	Diagnosis Code 2	Diagnosis Code 3
10	N18.3	V34.2	E73
11	F35.9	X29	D4.0
12	G27.2	J05.1	J60

I need to get the data into this format....

PersonID	N18.3	V34.2	E73	F35.9	G27.2	(plus all other codes)
10	1	1	1	0	0	etc
11	0	0	0	1	0	etc
12	0	0	0	0	1	etc

I have tried transposing, tabluation, so many other different ways and nothing seems to work. I'd appreciate any help!

CodePudding user response：

A data.table option using melt and dcast.

library(data.table)

dcast(melt(setDT(df),id.vars = 'PersonID'),PersonID ~ value, fun.aggregate = length)

#   PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
#1:       10    0   1     0     0     0   0     1     1   0
#2:       11    1   0     1     0     0   0     0     0   1
#3:       12    0   0     0     1     1   1     0     0   0

CodePudding user response：

We could reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider while specifying the values_fn as length and values_fill set to 0

library(dplyr)
library(tidyr)
df1 %>%
    pivot_longer(cols = -PersonID) %>%
    select(-name) %>%
    pivot_wider(names_from= value, values_from = value,
       values_fn= length, values_fill = 0)

-output

# A tibble: 3 x 10
  PersonID N18.3 V34.2   E73 F35.9   X29  D4.0 G27.2 J05.1   J60
     <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1       10     1     1     1     0     0     0     0     0     0
2       11     0     0     0     1     1     1     0     0     0
3       12     0     0     0     0     0     0     1     1     1

Or using recast from reshape2

library(reshape2)
recast(df1, id.var = 'PersonID', PersonID ~ value, length)
  PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
1       10    0   1     0     0     0   0     1     1   0
2       11    1   0     1     0     0   0     0     0   1
3       12    0   0     0     1     1   1     0     0   0

Or use base R with table after unlisting the columns other than the first one

table(rep(df1[[1]],  ncol(df1)-1),  unlist(df1[-1]))
      D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
  10    0   1     0     0     0   0     1     1   0
  11    1   0     1     0     0   0     0     0   1
  12    0   0     0     1     1   1     0     0   0

data

df1 <- structure(list(PersonID = 10:12, DiagnosisCode1 = c("N18.3", 
"F35.9", "G27.2"), DiagnosisCode2 = c("V34.2", "X29", "J05.1"
), DiagnosisCode3 = c("E73", "D4.0", "J60")), 
class = "data.frame", row.names = c(NA, 
-3L))