Home > OS >  Transform long-to-wide database spreading data on multiple columns in R
Transform long-to-wide database spreading data on multiple columns in R

Time:03-07

I am struggling in transforming a long dataframe to wide version, but with some complications. I have a column, ID, with duplicated entry - each refers to a timepoint for an individual. I have also other columns (namely, visit, var1 and var2), in which data regarding each timepoint are reported. Here is a reproducible example:

df <- data.frame(ID=c(1,1,1,1,2,2,2,3,3,3),
                 visit=c(1,4,5,7,1,3,4,2,5,6),
                 var1=c("AF","no","no","no","AG","AG","no","BA","BA","BA"),
                 var2=c("good","good","good","bad","good","good","bad","good","good","good"))

And the output:

   ID visit var1 var2
1   1     1   AF good
2   1     4   no good
3   1     5   no good
4   1     7   no  bad
5   2     1   AG good
6   2     3   AG good
7   2     4   no  bad
8   3     2   BA good
9   3     5   BA good
10  3     6   BA good

I really need to end up with a dataframe which contains only one row per ID, and multiple columns for each of the other variables, with a numeric suffix for example (such as visit_1, visit_2, visit_3 etc.).

The output I have in mind is like this:

  ID visit_1 visit_2 visit_3 visit_4 var1_1 var1_2 var1_3 var1_4 var2_1 var2_2 var2_3 var2_4
1  1       1       4       5       7     AF     no     no     no   good   good   good    bad
2  2       1       3       4      NA     AG     AG     no   <NA>   good   good    bad   <NA>
3  3       2       5       6      NA     BA     BA     BA   <NA>   good   good   good   <NA>.

in which essentially each entry of the columns visit, var1 and var2 has been placed in a separate, sequential column based on the ID column.

I have tried data.frame::dcast and tidyr::spread, and also pivot_wider() but it seems like these formula will end up with multiple column based on the actual values rather than producing fixed columns instead. For example, with pivot_wider():

df %>% pivot_wider(names_from = ID, values_from = c("visit","var1","var2"))

It returns me an error saying

Warning messages:
1: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 
2: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 
3: Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates 

Anyone can help?

CodePudding user response:

You have to add a sequential ID for each group:

library(tidyverse)
df %>% 
  group_by(ID) %>% 
  mutate(count = row_number()) %>% 
  pivot_wider(ID, names_from = count, values_from = c(visit, var1, var2))

# A tibble: 3 x 13
# Groups:   ID [3]
#     ID visit_1 visit_2 visit_3 visit_4 var1_1 var1_2 var1_3 var1_4 var2_1 var2_2 var2_3 var2_4
#  <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
#1     1       1       4       5       7 AF     no     no     no     good   good   good   bad   
#2     2       1       3       4      NA AG     AG     no     NA     good   good   bad    NA    
#3     3       2       5       6      NA BA     BA     BA     NA     good   good   good   NA    
  • Related