pivot_wider when names_from are duplicated-CodePudding

I need help pivoting data and then removing unnecessary rows. I ran a query in SQL that returned a rather large output (60 columns and 2 million rows). A lot of the rows are near duplicates, with the only unique value coming from a "diagnosis" column. I would like to pivot those to new columns and then remove the duplicates. The wrinkle is that my names_from column has duplicates. Here is an example

id <- c("1","1","1","2","2","2")
priority <- c("primary","secondary","tertiary","primary","primary","secondary")
diagnosis <- c("depression","anxiety","anorexia","depression","anxiety","ptsd")

data <- data.frame(id,priority,diagnosis)

id	priority	diagnosis
1	primary	depression
1	secondary	anxiety
1	tertiary	anorexia
2	primary	depression
2	primary	anxiety
2	secondary	PTSD

When I try to pivot_wider, person 2 has both depression and anxiety listed in "primary", but I want them in separate columns.

data2 <- pivot_wider(data,names_from=priority,values_from = diagnosis)

Here is what it is showing me:

id	primary	secondary	tertiary
1	depression	anxiety	anorexia
2	c("depression,"anxiety")	ptsd	NULL

So how can I get it to separate out depression and anxiety for person 2, ideally into something like primary_1, primary_2?

CodePudding user response：

Create a sequence by 'id', 'primary' and then do the reshaping

library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
data %>% 
  mutate(priority = str_c(priority, '_', rowid(id, priority))) %>% 
  pivot_wider(names_from = priority, values_from = diagnosis)

-output

# A tibble: 2 × 5
  id    primary_1  secondary_1 tertiary_1 primary_2
  <chr> <chr>      <chr>       <chr>      <chr>    
1 1     depression anxiety     anorexia   <NA>     
2 2     depression ptsd        <NA>       anxiety

Or another option is to return the list and then use unnest_wider on the list columns

data %>%
   pivot_wider(names_from = priority, values_from = diagnosis,
    values_fn = list) %>%
   unnest_wider(where(is.list), names_sep = "_")

-output

# A tibble: 2 × 5
  id    primary_1  primary_2 secondary_1 tertiary_1
  <chr> <chr>      <chr>     <chr>       <chr>     
1 1     depression <NA>      anxiety     anorexia  
2 2     depression anxiety   ptsd        <NA>

If we need to rename with 1, 2 only when the lengths are greater than 1, do the unnest separately after using unnest_wider

data %>%
   pivot_wider(names_from = priority, values_from = diagnosis,
    values_fn = list) %>% 
   unnest_wider(where(~ is.list(.x) && max(lengths(.x)) > 1),  
         names_sep = "_") %>% 
   unnest(where(is.list))

-output

# A tibble: 2 × 5
  id    primary_1  primary_2 secondary tertiary
  <chr> <chr>      <chr>     <chr>     <chr>   
1 1     depression <NA>      anxiety   anorexia
2 2     depression anxiety   ptsd      <NA>

If the data is really big, then dcast may be more efficient

dcast(setDT(data), id ~ paste0(priority,  "_", rowid(id, priority)), 
   value.var = 'diagnosis')
Key: <id>
       id  primary_1 primary_2 secondary_1 tertiary_1
   <char>     <char>    <char>      <char>     <char>
1:      1 depression      <NA>     anxiety   anorexia
2:      2 depression   anxiety        ptsd       <NA>