I need help pivoting data and then removing unnecessary rows. I ran a query in SQL that returned a rather large output (60 columns and 2 million rows). A lot of the rows are near duplicates, with the only unique value coming from a "diagnosis" column. I would like to pivot those to new columns and then remove the duplicates. The wrinkle is that my names_from column has duplicates. Here is an example
id <- c("1","1","1","2","2","2")
priority <- c("primary","secondary","tertiary","primary","primary","secondary")
diagnosis <- c("depression","anxiety","anorexia","depression","anxiety","ptsd")
data <- data.frame(id,priority,diagnosis)
id | priority | diagnosis |
---|---|---|
1 | primary | depression |
1 | secondary | anxiety |
1 | tertiary | anorexia |
2 | primary | depression |
2 | primary | anxiety |
2 | secondary | PTSD |
When I try to pivot_wider, person 2 has both depression and anxiety listed in "primary", but I want them in separate columns.
data2 <- pivot_wider(data,names_from=priority,values_from = diagnosis)
Here is what it is showing me:
id | primary | secondary | tertiary |
---|---|---|---|
1 | depression | anxiety | anorexia |
2 | c("depression,"anxiety") | ptsd | NULL |
So how can I get it to separate out depression and anxiety for person 2, ideally into something like primary_1, primary_2?
CodePudding user response:
Create a sequence by 'id', 'primary' and then do the reshaping
library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
data %>%
mutate(priority = str_c(priority, '_', rowid(id, priority))) %>%
pivot_wider(names_from = priority, values_from = diagnosis)
-output
# A tibble: 2 × 5
id primary_1 secondary_1 tertiary_1 primary_2
<chr> <chr> <chr> <chr> <chr>
1 1 depression anxiety anorexia <NA>
2 2 depression ptsd <NA> anxiety
Or another option is to return the list
and then use unnest_wider
on the list
columns
data %>%
pivot_wider(names_from = priority, values_from = diagnosis,
values_fn = list) %>%
unnest_wider(where(is.list), names_sep = "_")
-output
# A tibble: 2 × 5
id primary_1 primary_2 secondary_1 tertiary_1
<chr> <chr> <chr> <chr> <chr>
1 1 depression <NA> anxiety anorexia
2 2 depression anxiety ptsd <NA>
If we need to rename with 1, 2 only when the lengths are greater than 1, do the unnest
separately after using unnest_wider
data %>%
pivot_wider(names_from = priority, values_from = diagnosis,
values_fn = list) %>%
unnest_wider(where(~ is.list(.x) && max(lengths(.x)) > 1),
names_sep = "_") %>%
unnest(where(is.list))
-output
# A tibble: 2 × 5
id primary_1 primary_2 secondary tertiary
<chr> <chr> <chr> <chr> <chr>
1 1 depression <NA> anxiety anorexia
2 2 depression anxiety ptsd <NA>
If the data is really big, then dcast
may be more efficient
dcast(setDT(data), id ~ paste0(priority, "_", rowid(id, priority)),
value.var = 'diagnosis')
Key: <id>
id primary_1 primary_2 secondary_1 tertiary_1
<char> <char> <char> <char> <char>
1: 1 depression <NA> anxiety anorexia
2: 2 depression anxiety ptsd <NA>