I have a column with lists of variables.
Seperated by comma plus sometimes values for the variables set by "=".
See picture.
I want the variables as columns and within the columns TRUE/FALSE or 1/0 values plus if there is a value set by "=" an extra column for this value.
I guess it's a similar question to Pandas convert a column of list to dummies but I need it in R.
CodePudding user response:
Since you haven't provided explicit data, I needed to recreate one from your screenshots (please, update at least textual data the next time, it helps recreate your task).
Those chunks of code are explained with comments, they use tidyverse
functions from packages included at the top of the chunk. Result is what you asked for with the exception that columns eventnumber_value
are named value_eventnumber
since naming a variable or column with a name that starts with number is not a good practice.
I don't know what you need the data for, but from my experience the wide format of the data is less useful than wide format for most of the cases. Especially here, since I expect, that one event may happen only for one ID. Thus, dat_pivoted
is more convenient to operate on.
library(tibble)
library(tidyr)
library(dplyr)
library(stringr)
dat <- tribble(
~post_event_list, ~date_time,
"239=20.00,200,20149,100,101,102,103,104,105,106,107,108,114,198", "2022-03-01 00:23:50",
"257,159", "2022-03-01 00:02:51",
"201,109,110,111,112", "2022-03-01 00:57:23"
)
dat_pivoted <- dat %>%
mutate(post_event_list = str_split(post_event_list, ",")) %>% # transform comma separated strings into character vectors
unnest_longer(post_event_list) %>% # split characters into separate rows
separate(post_event_list, sep = "=", into = c("var", "val"), fill = "right") %>% # separate variables from values (case of 'X=Y'), put NA as value if there is no value
mutate(val = as.numeric(val)) # treat 'val' column as numeric
dat_values <- dat_pivoted %>%
pivot_wider(id_cols = date_time, names_from = var, names_prefix = "value_", values_from = val) %>% # turn data into wide format -- make a column per each event value, present or not
select(!where(~ all(is.na(.x)))) # select only those values columns, where not every element is NA
dat_indicator <- dat_pivoted %>%
mutate(val = TRUE) %>% # each row indicates a presence of event -- change all values to TRUE
pivot_wider(id_cols = date_time, names_from = var, values_from = val, values_fill = FALSE) # pivot columns again, replacing resulting NAs witth FALSE
dat_transformed <- left_join(dat_indicator, dat_values)