How to generate a set of dummy variables dependent on values in several other columns with same pref-CodePudding

I have a data frame I am trying to put in a format I can use for my analysis that looks the following:

ID	Name	Year	K1	K2	K50
1	Contract XYZ	2000	transport	elephants
2	Agreement ABC	2003	pens	music
3	Document 123	2003	elephants
4	Empty Space	2004	music		transport

Basically, I have documents with a unique ID and the year they were signed plus a set of variables name K_1 to K_50 that contain unordered keywords. For each of the keywords I would like to generate a dummy variable (i.e. named transport, pens...) that is one if any of the K_1 to K_50 entries contains that particular string.

I have tried the code below which creates the dummy with the name transport if K1 or K2 contain the code but for 50 keywords in 50 columns it is a lot of manual work. Ideally, I would be able to use the string as the variable name and run through all 50 columns to identify all keywords and create dummies that will be one if the keyword is present in either column for a particular ID. I would, however, also be happy with manually creating dummies and being able to look at all 50 keywords without typing it all in.

document_dummies <- mutate(document,
                  transport = case_when(
                    K_1 == "transport" | K_2 == "transport" ~ 1,
                    TRUE ~ NA_real_ 
                  ))

CodePudding user response：

Does this help you?

library(tidyverse)

document <- data.frame(
  stringsAsFactors = FALSE,
  ID = c(1L, 2L, 3L, 4L),
  Name = c("Contract XYZ","Agreement ABC",
           "Document 123","Empty Space"),
  Year = c(2000L, 2003L, 2003L, 2004L),
  K1 = c("transport", "pens", "elephants", "music"),
  K2 = c("elephants", "music", NA, NA),
  K50 = c(NA, NA, NA, "transport")
)
document %>%
  pivot_longer(starts_with("K")) %>%
  select(-name) %>%
  filter(! is.na(value)) %>%
  mutate(has_property = 1) %>%
  pivot_wider(names_from = value, values_from = has_property)
#> # A tibble: 4 x 7
#>      ID Name           Year transport elephants  pens music
#>   <int> <chr>         <int>     <dbl>     <dbl> <dbl> <dbl>
#> 1     1 Contract XYZ   2000         1         1    NA    NA
#> 2     2 Agreement ABC  2003        NA        NA     1     1
#> 3     3 Document 123   2003        NA         1    NA    NA
#> 4     4 Empty Space    2004         1        NA    NA     1

^{Created on 2021-09-21 by the reprex package (v2.0.1)}

CodePudding user response：

We can use the fastDummies package, with the dummy_columns function. With the example data from @danlooo

document %>% pivot_longer(matches('K\\d '), names_to = NULL) %>%
        filter(!is.na(value)) %>%
        fastDummies::dummy_columns('value') %>%
        rename_with(~str_remove(.x, '^value_'), starts_with('value_'))

# A tibble: 7 x 8
     ID Name           Year value     elephants music  pens transport
  <int> <chr>         <int> <chr>         <int> <int> <int>     <int>
1     1 Contract XYZ   2000 transport         0     0     0         1
2     1 Contract XYZ   2000 elephants         1     0     0         0
3     2 Agreement ABC  2003 pens              0     0     1         0
4     2 Agreement ABC  2003 music             0     1     0         0
5     3 Document 123   2003 elephants         1     0     0         0
6     4 Empty Space    2004 music             0     1     0         0
7     4 Empty Space    2004 transport         0     0     0         1

data

document <- data.frame(
        stringsAsFactors = FALSE,
        ID = c(1L, 2L, 3L, 4L),
        Name = c("Contract XYZ","Agreement ABC",
                 "Document 123","Empty Space"),
        Year = c(2000L, 2003L, 2003L, 2004L),
        K1 = c("transport", "pens", "elephants", "music"),
        K2 = c("elephants", "music", NA, NA),
        K50 = c(NA, NA, NA, "transport")
)

> document
  ID          Name Year        K1        K2       K50
1  1  Contract XYZ 2000 transport elephants      <NA>
2  2 Agreement ABC 2003      pens     music      <NA>
3  3  Document 123 2003 elephants      <NA>      <NA>
4  4   Empty Space 2004     music      <NA> transport