I have a really large data set that I need to create new columns for based on text strings from other columns, and put them in a binary format. I have a person ID column and a set of 99 "Diagnosis Codes" that contain a text string that corresponds to a particular health condition.
Sample of Original Data
Person ID | Diagnosis Code 1 | Diagnosis Code 2 | Diagnosis Code 3 |
---|---|---|---|
10 | N18.3 | V34.2 | E73 |
11 | F35.9 | X29 | D4.0 |
12 | G27.2 | J05.1 | J60 |
I need to get the data into this format....
PersonID | N18.3 | V34.2 | E73 | F35.9 | G27.2 | (plus all other codes) |
---|---|---|---|---|---|---|
10 | 1 | 1 | 1 | 0 | 0 | etc |
11 | 0 | 0 | 0 | 1 | 0 | etc |
12 | 0 | 0 | 0 | 0 | 1 | etc |
I have tried transposing, tabluation, so many other different ways and nothing seems to work. I'd appreciate any help!
CodePudding user response:
A data.table
option using melt
and dcast
.
library(data.table)
dcast(melt(setDT(df),id.vars = 'PersonID'),PersonID ~ value, fun.aggregate = length)
# PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
#1: 10 0 1 0 0 0 0 1 1 0
#2: 11 1 0 1 0 0 0 0 0 1
#3: 12 0 0 0 1 1 1 0 0 0
CodePudding user response:
We could reshape to 'long' format with pivot_longer
and then reshape back to 'wide' with pivot_wider
while specifying the values_fn
as length
and values_fill
set to 0
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -PersonID) %>%
select(-name) %>%
pivot_wider(names_from= value, values_from = value,
values_fn= length, values_fill = 0)
-output
# A tibble: 3 x 10
PersonID N18.3 V34.2 E73 F35.9 X29 D4.0 G27.2 J05.1 J60
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 10 1 1 1 0 0 0 0 0 0
2 11 0 0 0 1 1 1 0 0 0
3 12 0 0 0 0 0 0 1 1 1
Or using recast
from reshape2
library(reshape2)
recast(df1, id.var = 'PersonID', PersonID ~ value, length)
PersonID D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
1 10 0 1 0 0 0 0 1 1 0
2 11 1 0 1 0 0 0 0 0 1
3 12 0 0 0 1 1 1 0 0 0
Or use base R
with table
after unlist
ing the columns other than the first one
table(rep(df1[[1]], ncol(df1)-1), unlist(df1[-1]))
D4.0 E73 F35.9 G27.2 J05.1 J60 N18.3 V34.2 X29
10 0 1 0 0 0 0 1 1 0
11 1 0 1 0 0 0 0 0 1
12 0 0 0 1 1 1 0 0 0
data
df1 <- structure(list(PersonID = 10:12, DiagnosisCode1 = c("N18.3",
"F35.9", "G27.2"), DiagnosisCode2 = c("V34.2", "X29", "J05.1"
), DiagnosisCode3 = c("E73", "D4.0", "J60")),
class = "data.frame", row.names = c(NA,
-3L))