I have a data frame I am trying to put in a format I can use for my analysis that looks the following:
ID | Name | Year | K1 | K2 | ... | K50 |
---|---|---|---|---|---|---|
1 | Contract XYZ | 2000 | transport | elephants | ||
2 | Agreement ABC | 2003 | pens | music | ||
3 | Document 123 | 2003 | elephants | |||
4 | Empty Space | 2004 | music | transport |
Basically, I have documents with a unique ID and the year they were signed plus a set of variables name K_1 to K_50 that contain unordered keywords. For each of the keywords I would like to generate a dummy variable (i.e. named transport, pens...) that is one if any of the K_1 to K_50 entries contains that particular string.
I have tried the code below which creates the dummy with the name transport if K1 or K2 contain the code but for 50 keywords in 50 columns it is a lot of manual work. Ideally, I would be able to use the string as the variable name and run through all 50 columns to identify all keywords and create dummies that will be one if the keyword is present in either column for a particular ID. I would, however, also be happy with manually creating dummies and being able to look at all 50 keywords without typing it all in.
document_dummies <- mutate(document,
transport = case_when(
K_1 == "transport" | K_2 == "transport" ~ 1,
TRUE ~ NA_real_
))
CodePudding user response:
Does this help you?
library(tidyverse)
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
document %>%
pivot_longer(starts_with("K")) %>%
select(-name) %>%
filter(! is.na(value)) %>%
mutate(has_property = 1) %>%
pivot_wider(names_from = value, values_from = has_property)
#> # A tibble: 4 x 7
#> ID Name Year transport elephants pens music
#> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Contract XYZ 2000 1 1 NA NA
#> 2 2 Agreement ABC 2003 NA NA 1 1
#> 3 3 Document 123 2003 NA 1 NA NA
#> 4 4 Empty Space 2004 1 NA NA 1
Created on 2021-09-21 by the reprex package (v2.0.1)
CodePudding user response:
We can use the fastDummies
package, with the dummy_columns
function.
With the example data from @danlooo
document %>% pivot_longer(matches('K\\d '), names_to = NULL) %>%
filter(!is.na(value)) %>%
fastDummies::dummy_columns('value') %>%
rename_with(~str_remove(.x, '^value_'), starts_with('value_'))
# A tibble: 7 x 8
ID Name Year value elephants music pens transport
<int> <chr> <int> <chr> <int> <int> <int> <int>
1 1 Contract XYZ 2000 transport 0 0 0 1
2 1 Contract XYZ 2000 elephants 1 0 0 0
3 2 Agreement ABC 2003 pens 0 0 1 0
4 2 Agreement ABC 2003 music 0 1 0 0
5 3 Document 123 2003 elephants 1 0 0 0
6 4 Empty Space 2004 music 0 1 0 0
7 4 Empty Space 2004 transport 0 0 0 1
data
document <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 2L, 3L, 4L),
Name = c("Contract XYZ","Agreement ABC",
"Document 123","Empty Space"),
Year = c(2000L, 2003L, 2003L, 2004L),
K1 = c("transport", "pens", "elephants", "music"),
K2 = c("elephants", "music", NA, NA),
K50 = c(NA, NA, NA, "transport")
)
> document
ID Name Year K1 K2 K50
1 1 Contract XYZ 2000 transport elephants <NA>
2 2 Agreement ABC 2003 pens music <NA>
3 3 Document 123 2003 elephants <NA> <NA>
4 4 Empty Space 2004 music <NA> transport