Home > Enterprise >  Convert Duplicated data from long to wide, while preserving key-value pairs in R
Convert Duplicated data from long to wide, while preserving key-value pairs in R

Time:11-03

I need to create a unique dataset (based on "id" variable) from a duplicated dataset, where the number of duplicates is highly variable. I created a dummy dataset below (with 5 key-value pairs), and it captures the essence of the real dataset.

library(dplyr)

df <- data.frame(id = c(1, 2, 3, 3, 2, 3), key = c(NA, "UJD02 JFF00", "UJD05 TPX10 DV071", "KFC10 DR036 UGC12 UEN05 XXA00", "UJD05", "DR036 UJD05 JFF00 TPX10"), value1 = c(23, NA, 45, 67, 11, 1), value2 = c(45, NA, 23, NA, 25, 78), value3 = c(89, NA, 103, 6700, 89, 50), value4 = c(786, 670, 987, 67, 12, 14), value5 = c(10, NA, 29, 15, 51, 3))

The real dataset contains both unique and duplicated observations as determined by the "id" variable, the number of duplicates is varies from two to those with way more than 2 duplicates. The "id" variable indicates which observations are to be converted from long to wide in order to end up with a dataset of unique "id" records only i.e. no duplicated "id"s. The "key" variable is a composite variable that can contain between missing (i.e. NA) up to 30 tab separated values. There are exactly 30 "valueX" variables (i.e. value1 - value30). In an observation (irrespective of duplication status), each key is coupled to a value, e.g. key1 to value1, key2 to value2 ... key30 to value30. Duplication indicates that the data was collected at a different time point for the same customer, therefore the respective key-value pairs of the duplicates should not be mixed up in the process of long to wide conversion.

The first thing I did was to split the composite "key" variable, into 30 variables (key1 - key30), resulting in the dataset similar to "df2".

df %>% separate(key, c("key1", "key2", "key3", "key4", "key5")) -> df2

But after that I am not sure how to proceed with de-duplication based on "id"s while avoiding mixing up key-value pairs. Perhaps I might need to dynamically rename (by numbering) key-value pairs to indicate the duplicates? Now really sure.

So where I need help is on how to then convert the long dataset (df2) into a wide dataset ending up with unique "id" variables (just 3 rows of id 1 - 3) with key-value pairs that are not mixed-up i.e. indicate which key records go with which value pair. E.g. within the dummy dataset, for id = 3, which is duplicated 3 times, I would end up with key1-value1 ... key15-value15

Greatly appreciate any help!

EDITED TO PROVIDE EXAMPLE OUTPUT

Below is the desired output

library(wrapr)

resultX <- wrapr::build_frame(
  "id"  , "key1"       , "key2"       , "key3"       , "key4", "key5", "key6"       , "key7"       , "key8"       , "key9"       , "key10"      , "key11"      , "key12"      , "key13"      , "key14"      , "key15", "value1", "value2", "value3", "value4", "value5", "value6", "value7", "value8", "value9", "value10", "value11", "value12", "value13", "value14", "value15" |
    1   , NA_character_, NA_character_, NA_character_, NA    , NA    , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA     , 23      , 45      , 89      , 786     , 10      , NA_real_, NA_real_, NA_real_, NA_real_, NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_  |
    2   , "UJD02"      , "JFF00"      , NA_character_, NA    , NA    , "UJD05"      , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA     , NA_real_, NA_real_, NA_real_, 670     , NA_real_, 11      , 25      , 89      , 12      , 51       , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_  |
    3   , "UJD05"      , "TPX10"      , "DV071"      , NA    , NA    , "KFC10"      , "DR036"      , "UGC12"      , "UEN05"      , "XXA00"      , "DR036"      , "UJD05"      , "JFF00"      , "TPX10"      , NA     , 45      , 23      , 103     , 987     , 29      , 67      , NA_real_, 6700    , 67      , 15       , 1        , 78       , 50       , 14       , 3         )

CodePudding user response:

Two pivots do it:

library(dplyr)
library(tidyr) # pivot_*
df2 %>%
  pivot_longer(-id, names_pattern = "(.*?)([0-9] )", names_to = c(".value", "iter")) %>%
  group_by(id) %>%
  mutate(iter = row_number()) %>%
  pivot_wider(id, names_from = "iter", values_from = c("key", "value"), names_sep = "") %>%
  ungroup()
# # A tibble: 3 x 31
#      id key1  key2  key3  key4  key5  key6  key7  key8  key9  key10 key11 key12 key13 key14 key15 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15
#   <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
# 1     1 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>      23     45     89    786     10     NA     NA     NA     NA      NA      NA      NA      NA      NA      NA
# 2     2 UJD02 JFF00 <NA>  <NA>  <NA>  UJD05 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>      NA     NA     NA    670     NA     11     25     89     12      51      NA      NA      NA      NA      NA
# 3     3 UJD05 TPX10 DV071 <NA>  <NA>  KFC10 DR036 UGC12 UEN05 XXA00 DR036 UJD05 JFF00 TPX10 <NA>      45     23    103    987     29     67     NA   6700     67      15       1      78      50      14       3
  • Related