Home > other >  Convert a column of list to dummy in R
Convert a column of list to dummy in R

Time:04-02

I have a column with lists of variables.

Seperated by comma plus sometimes values for the variables set by "=".

See picture.

enter image description here

I want the variables as columns and within the columns TRUE/FALSE or 1/0 values plus if there is a value set by "=" an extra column for this value.

enter image description here

I guess it's a similar question to Pandas convert a column of list to dummies but I need it in R.

CodePudding user response:

Since you haven't provided explicit data, I needed to recreate one from your screenshots (please, update at least textual data the next time, it helps recreate your task).

Those chunks of code are explained with comments, they use tidyverse functions from packages included at the top of the chunk. Result is what you asked for with the exception that columns eventnumber_value are named value_eventnumber since naming a variable or column with a name that starts with number is not a good practice.

I don't know what you need the data for, but from my experience the wide format of the data is less useful than wide format for most of the cases. Especially here, since I expect, that one event may happen only for one ID. Thus, dat_pivoted is more convenient to operate on.

library(tibble)
library(tidyr)
library(dplyr)
library(stringr)

dat <- tribble(
  ~post_event_list, ~date_time,
  "239=20.00,200,20149,100,101,102,103,104,105,106,107,108,114,198", "2022-03-01 00:23:50",
  "257,159", "2022-03-01 00:02:51",
  "201,109,110,111,112", "2022-03-01 00:57:23"
)

dat_pivoted <- dat %>%
  mutate(post_event_list = str_split(post_event_list, ",")) %>% # transform comma separated strings into character vectors
  unnest_longer(post_event_list) %>% # split characters into separate rows
  separate(post_event_list, sep = "=", into = c("var", "val"), fill = "right") %>% # separate variables from values (case of 'X=Y'), put NA as value if there is no value
  mutate(val = as.numeric(val)) # treat 'val' column as numeric

dat_values <- dat_pivoted %>%
  pivot_wider(id_cols = date_time, names_from = var, names_prefix = "value_", values_from = val) %>% # turn data into wide format -- make a column per each event value, present or not
  select(!where(~ all(is.na(.x)))) # select only those values columns, where not every element is NA

dat_indicator <- dat_pivoted %>%
  mutate(val = TRUE) %>% # each row indicates a presence of event -- change all values to TRUE
  pivot_wider(id_cols = date_time, names_from = var, values_from = val, values_fill = FALSE) # pivot columns again, replacing resulting NAs witth FALSE

dat_transformed <- left_join(dat_indicator, dat_values)

  •  Tags:  
  • r
  • Related