I need to create a unique dataset (based on "id" variable) from a duplicated dataset, where the number of duplicates is highly variable. I created a dummy dataset below (with 5 key-value pairs), and it captures the essence of the real dataset.
library(dplyr)
df <- data.frame(id = c(1, 2, 3, 3, 2, 3), key = c(NA, "UJD02 JFF00", "UJD05 TPX10 DV071", "KFC10 DR036 UGC12 UEN05 XXA00", "UJD05", "DR036 UJD05 JFF00 TPX10"), value1 = c(23, NA, 45, 67, 11, 1), value2 = c(45, NA, 23, NA, 25, 78), value3 = c(89, NA, 103, 6700, 89, 50), value4 = c(786, 670, 987, 67, 12, 14), value5 = c(10, NA, 29, 15, 51, 3))
The real dataset contains both unique and duplicated observations as determined by the "id" variable, the number of duplicates is varies from two to those with way more than 2 duplicates. The "id" variable indicates which observations are to be converted from long to wide in order to end up with a dataset of unique "id" records only i.e. no duplicated "id"s. The "key" variable is a composite variable that can contain between missing (i.e. NA) up to 30 tab separated values. There are exactly 30 "valueX" variables (i.e. value1 - value30). In an observation (irrespective of duplication status), each key is coupled to a value, e.g. key1 to value1, key2 to value2 ... key30 to value30. Duplication indicates that the data was collected at a different time point for the same customer, therefore the respective key-value pairs of the duplicates should not be mixed up in the process of long to wide conversion.
The first thing I did was to split the composite "key" variable, into 30 variables (key1 - key30), resulting in the dataset similar to "df2".
df %>% separate(key, c("key1", "key2", "key3", "key4", "key5")) -> df2
But after that I am not sure how to proceed with de-duplication based on "id"s while avoiding mixing up key-value pairs. Perhaps I might need to dynamically rename (by numbering) key-value pairs to indicate the duplicates? Now really sure.
So where I need help is on how to then convert the long dataset (df2) into a wide dataset ending up with unique "id" variables (just 3 rows of id 1 - 3) with key-value pairs that are not mixed-up i.e. indicate which key records go with which value pair. E.g. within the dummy dataset, for id = 3, which is duplicated 3 times, I would end up with key1-value1 ... key15-value15
Greatly appreciate any help!
EDITED TO PROVIDE EXAMPLE OUTPUT
Below is the desired output
library(wrapr)
resultX <- wrapr::build_frame(
"id" , "key1" , "key2" , "key3" , "key4", "key5", "key6" , "key7" , "key8" , "key9" , "key10" , "key11" , "key12" , "key13" , "key14" , "key15", "value1", "value2", "value3", "value4", "value5", "value6", "value7", "value8", "value9", "value10", "value11", "value12", "value13", "value14", "value15" |
1 , NA_character_, NA_character_, NA_character_, NA , NA , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , 23 , 45 , 89 , 786 , 10 , NA_real_, NA_real_, NA_real_, NA_real_, NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
2 , "UJD02" , "JFF00" , NA_character_, NA , NA , "UJD05" , NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_, NA , NA_real_, NA_real_, NA_real_, 670 , NA_real_, 11 , 25 , 89 , 12 , 51 , NA_real_ , NA_real_ , NA_real_ , NA_real_ , NA_real_ |
3 , "UJD05" , "TPX10" , "DV071" , NA , NA , "KFC10" , "DR036" , "UGC12" , "UEN05" , "XXA00" , "DR036" , "UJD05" , "JFF00" , "TPX10" , NA , 45 , 23 , 103 , 987 , 29 , 67 , NA_real_, 6700 , 67 , 15 , 1 , 78 , 50 , 14 , 3 )
CodePudding user response:
Two pivots do it:
library(dplyr)
library(tidyr) # pivot_*
df2 %>%
pivot_longer(-id, names_pattern = "(.*?)([0-9] )", names_to = c(".value", "iter")) %>%
group_by(id) %>%
mutate(iter = row_number()) %>%
pivot_wider(id, names_from = "iter", values_from = c("key", "value"), names_sep = "") %>%
ungroup()
# # A tibble: 3 x 31
# id key1 key2 key3 key4 key5 key6 key7 key8 key9 key10 key11 key12 key13 key14 key15 value1 value2 value3 value4 value5 value6 value7 value8 value9 value10 value11 value12 value13 value14 value15
# <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 23 45 89 786 10 NA NA NA NA NA NA NA NA NA NA
# 2 2 UJD02 JFF00 <NA> <NA> <NA> UJD05 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> NA NA NA 670 NA 11 25 89 12 51 NA NA NA NA NA
# 3 3 UJD05 TPX10 DV071 <NA> <NA> KFC10 DR036 UGC12 UEN05 XXA00 DR036 UJD05 JFF00 TPX10 <NA> 45 23 103 987 29 67 NA 6700 67 15 1 78 50 14 3