I am searching an optimal way to change variable categories in dummy variables-CodePudding

I have some patients who receive different treatments at different times. I want to change the treatment they have received into a binary variable that takes the value of 1 if the patient has received the drug at least once and the value of 0 if they have never received it.

I managed to do this but in a tedious way, which could be difficult with dozens of different types of drugs.

I would like to optimize my code which mainly avoids creating all the binary variables related to the drug one by one.

id<-c(rep("A",7),rep("B",4))
medoc<-c("par","mor","mor","par","sed","sed",
         "sed","cur","sed","cur","sed")

mydata<-data.frame(id,medoc)

mydata2<-mydata%>%group_by(id)%>%
  mutate(medoc_str=paste(unique(medoc),collapse  = " "))%>%
  distinct(id,.keep_all = TRUE)

mydata2$par<-NA
mydata2$mor<-NA
mydata2$sed<-NA
mydata2$cur<-NA

mydata2$par<-ifelse(
  grepl("par",mydata2$medoc_str)==TRUE,1,0
)

mydata2$mor<-ifelse(
  grepl("mor",mydata2$medoc_str)==TRUE,1,0
)

mydata2$sed<-ifelse(
  grepl("sed",mydata2$medoc_str)==TRUE,1,0
)

mydata2$cur<-ifelse(
  grepl("cur",mydata2$medoc_str)==TRUE,1,0
)

CodePudding user response：

If I understand it, you want to dummify your variables. We can do it with tidyr::pivot_wider too, but I really like to use specific libraries to do it very easily. I like the fastDummies package:

library(fastDummies)

dummy_cols(mydata, select_columns = 'medoc')

   id medoc medoc_cur medoc_mor medoc_par medoc_sed
1   1   par         0         0         1         0
2   1   mor         0         1         0         0
3   1   mor         0         1         0         0
4   1   par         0         0         1         0
5   1   sed         0         0         0         1
6   1   sed         0         0         0         1
7   1   sed         0         0         0         1
8   2   cur         1         0         0         0
9   2   sed         0         0         0         1
10  2   cur         1         0         0         0
11  2   sed         0         0         0         1

And here is an answer with pivot_wider:

library(tidyr)
library(dplyr)
mydata %>% mutate(index = row_number()) %>%
  pivot_wider(names_from = medoc,
              values_from = medoc,
              values_fn = \(x)  !is.na(x),
              values_fill = 0)

CodePudding user response：

A solution similar to @Guedes's but with different values_fn:

library(dplyr)
library(tidyr)

mydata %>%
  mutate(row = row_number()) %>%
  pivot_wider(names_from = medoc, values_from = medoc,
              values_fn = function(x) 1, values_fill = 0) %>%
  select(-row)
# A tibble: 11 x 5
      id   par   mor   sed   cur
   <dbl> <dbl> <dbl> <dbl> <dbl>
 1     1     1     0     0     0
 2     1     0     1     0     0
 3     1     0     1     0     0
 4     1     1     0     0     0
 5     1     0     0     1     0
 6     1     0     0     1     0
 7     1     0     0     1     0
 8     2     0     0     0     1
 9     2     0     0     1     0
10     2     0     0     0     1
11     2     0     0     1     0

CodePudding user response：

Using mydata in the Note at the end and assuming that you want one row for each id with binary columns indicating which values of medoc are present (1) or absent (0) we can use table like this. (If you would like counts instead of presence/absence then omit the pmin.)

pmin(table(mydata), 1)
##    medoc
## id  cur mor par sed
##  A   0   1   1   1
##  B   1   0   0   1

or as a data frame and adding medoc_str

library(dplyr)
library(tibble)

mydata %>%
  table %>%
  pmin(1) %>%
  as.data.frame.matrix %>%
  rownames_to_column(var = "id") %>%
  group_by(id) %>%
  mutate(medoc_str = paste(names(cur_data())[c_across() == 1], collapse = " ")) %>%
  ungroup
## # A tibble: 2 x 6
##   id      cur   mor   par   sed medoc_str  
##   <chr> <dbl> <dbl> <dbl> <dbl> <chr>      
## 1 A         0     1     1     1 mor par sed
## 2 B         1     0     0     1 cur sed

Note

Modified mydata to use better id's.

id <- c(rep("A", 7), rep("B", 4))
medoc <- c("par", "mor", "mor", "par", "sed", "sed",
         "sed", "cur", "sed", "cur", "sed")

mydata <- data.frame(id, medoc)

mydata looks like this.

> mydata
   id medoc
1   A   par
2   A   mor
3   A   mor
4   A   par
5   A   sed
6   A   sed
7   A   sed
8   B   cur
9   B   sed
10  B   cur
11  B   sed