Home > Back-end >  pivot_wider() update now requires name specification and need a new solution
pivot_wider() update now requires name specification and need a new solution

Time:08-09

I am working with health data and have written code to group operation codes with their laterality code into code groups by the day on which they were performed.

I have written code which worked with tidyr(1.1.4) and now fails on tidyr(1.2.0).

reprex below

    library(tidyverse)

df <- tribble(
~row, ~opdate_grp, ~side, ~OPERTN, ~OPDATE, ~opgrp,
1,1,"bilateral","W373","2018-03-23",1,
1,1,"bilateral","Z941","2018-03-23",1,
2,1,"bilateral","W374","2018-10-12",1,
2,1,"bilateral","Z941","2018-10-12",1,
3,1,"left","W383","2020-12-08",1,
3,1,"left","Z943","2020-12-08",1,
4,2,"none","O325","2013-03-31",2,
5,1,"none","G47","2012-01-22",1,
5,2,"none","C488","2012-01-25",2,
6,1,"none","X90","2018-12-25",1,
6,2,"none","T36","2018-12-28",2
        )

The next chunk was written to reshape the data such that the resulting df included a single column for OPERTN & OPDATE and opdate_grp was an index indicating sequence of operations and dates so they could be filtered later on in the work flow.

The previous code (below) worked with tidyr(version 1.2.0).

df %>% 
  # Make operation groups into a row for string searching
  pivot_wider(values_from= c(OPERTN, OPDATE), 
              names_sep = "",
              values_fill=NA,
              values_fn = list(OPERTN= ~unlist(.x)%>% toString(), 
                               OPDATE = ~.x[1])) 

to produce a table where each row number has operation code groups which are indexed using opgrp:

# A tibble: 8 × 6
#    row opdate_grp side      OPERTN     OPDATE     opgrp
#  <dbl>      <dbl> <chr>     <chr>      <chr>      <dbl>
#     1          1 bilateral W373, Z941 2018-03-23     1
#     2          1 bilateral W374, Z941 2018-10-12     1
#     3          1 left      W383, Z943 2020-12-08     1
#     4          2 none      O325       2013-03-31     1
#     5          1 none      G47        2012-01-22     1
#     5          2 none      C488       2012-01-25     2
#     6          1 none      X90        2018-12-25     1
#     6          2 none      T36        2018-12-28     2

However, when specifying names_from = opdate_grp, which is required in tidyr(1.2.0), the resulting df contains seperate columns for each operation code gorup and date, which is annoying and stops my code from working. The reult is below:

df %>% 
  # Make operation groups into a row for string searching
  pivot_wider(values_from= c(OPERTN, OPDATE), 
              names_from= opgrp,
              names_sep = "_",
              #values_fill=NA,
              values_fn = list(OPERTN= ~unlist(.x)%>% toString(), 
                               OPDATE = ~.x[1])) 
# A tibble: 8 × 7
#    row opdate_grp side      OPERTN1    OPERTN2 OPDATE1    OPDATE2   
#  <dbl>      <dbl> <chr>     <chr>      <chr>   <chr>      <chr>     
#     1          1 bilateral W373, Z941 NA      2018-03-23 NA        
#     2          1 bilateral W374, Z941 NA      2018-10-12 NA        
#     3          1 left      W383, Z943 NA      2020-12-08 NA        
#     4          2 none      NA         O325    NA         2013-03-31
#     5          1 none      G47        NA      2012-01-22 NA        
#     5          2 none      NA         C488    NA         2012-01-25
#     6          1 none      X90        NA      2018-12-25 NA        
#     6          2 none      NA         T36     NA         2018-12-28

What I have tried

I have tried repeating reshaping functions to bring the multiple columns into one but this is clunky and does not scale well.

I have tried rewriting pivot_wider() code but this fails as I cannot alter the invisible processes.

Question:

Is there a work around to prevent this new behaviour or to fix this?

My real data contains up to 15 operation groups for each row and over 10 million rows, so the solution should scale and be stable (i hope) with future updates.

CodePudding user response:

Below code works in tidyr 1.2.0

library(tidyr)
library(dplyr)
df %>%
  pivot_wider(values_from = c(OPERTN, OPDATE), 
    names_from = opgrp, values_fn = list(OPERTN= ~unlist(.x)%>% 
               toString(), 
                               OPDATE = ~.x[1]), names_sep = "")

-output

# A tibble: 8 × 7
    row opdate_grp side      OPERTN1    OPERTN2 OPDATE1    OPDATE2   
  <dbl>      <dbl> <chr>     <chr>      <chr>   <chr>      <chr>     
1     1          1 bilateral W373, Z941 <NA>    2018-03-23 <NA>      
2     2          1 bilateral W374, Z941 <NA>    2018-10-12 <NA>      
3     3          1 left      W383, Z943 <NA>    2020-12-08 <NA>      
4     4          2 none      <NA>       O325    <NA>       2013-03-31
5     5          1 none      G47        <NA>    2012-01-22 <NA>      
6     5          2 none      <NA>       C488    <NA>       2012-01-25
7     6          1 none      X90        <NA>    2018-12-25 <NA>      
8     6          2 none      <NA>       T36     <NA>       2018-12-28

If the output required is a single column, use group_by with summarise

df %>%
   group_by(row, opdate_grp, side, OPDATE) %>% 
   summarise(OPERTN = toString(OPERTN), .groups = 'drop')
# A tibble: 8 × 5
    row opdate_grp side      OPERTN     OPDATE    
  <dbl>      <dbl> <chr>     <chr>      <chr>     
1     1          1 bilateral W373, Z941 2018-03-23
2     2          1 bilateral W374, Z941 2018-10-12
3     3          1 left      W383, Z943 2020-12-08
4     4          2 none      O325       2013-03-31
5     5          1 none      G47        2012-01-22
6     5          2 none      C488       2012-01-25
7     6          1 none      X90        2018-12-25
8     6          2 none      T36        2018-12-28
  • Related