I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique()
or distinct()
because then some data would be erased, as the rows are not identical.
Please see this example:
# The style of df I have
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df
# IDs Intervention Other_data1 Other_data2
# 1 1 Progesterone 22 a
# 2 1 Stitch 22 a
# 3 2 <NA> 32 b
# 4 3 Stitch 44 c
# 5 4 Progesterone 24 d
# 6 4 Stitch 24 d
# 7 4 Pessary 24 d
# 8 5 Progesterone NA e
So if I used unique()
I would lose the full information in the df$Intervention
column.
Please could someone let me know how I can get the df into this format:
# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
Stitch=c("Yes", NA, "Yes", "Yes", "No"),
Pessary=c("No", NA, "No", "Yes", "No"),
Other_data1= c(22,32,44,24,NA),
Other_data2=c("a","b","c","d","e"))
df_I_want
# IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1 1 Yes Yes No 22 a
# 2 2 <NA> <NA> <NA> 32 b
# 3 3 No Yes No 44 c
# 4 4 Yes Yes Yes 24 d
# 5 5 Yes No No NA e
My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data
so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention
.
CodePudding user response:
Here is another pivot_wider
solution, but here I use mutate
and case_when
to identify their corresponding values under the newly expanded columns.
If all of the three newly expanded columns are NA
, they should remain NA
. Otherwise, treat NA
as "No" and non-NA
as "Yes".
Note that within across()
, you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone
, Stitch
and Pessary
are newly created, and they are in position 4 to 6, therefore 4:6
).
Edit: Added length(unique(na.omit(df$Intervention)))
when calculating and comparing the number of NA
s across the newly expanded columns so that it's more dynamic
library(tidyverse)
df %>%
pivot_wider(names_from = Intervention, values_from = Intervention) %>%
select(-"NA") %>%
mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == length(unique(na.omit(df$Intervention))) ~ NA_character_,
is.na(.x) ~ "No",
!is.na(.x) ~ "Yes")))
# A tibble: 5 × 6
IDs Other_data1 Other_data2 Progesterone Stitch Pessary
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 22 a Yes Yes No
2 2 32 b NA NA NA
3 3 44 c No Yes No
4 4 24 d Yes Yes Yes
5 5 NA e Yes No No
CodePudding user response:
Updated: I have now updated the code so that you have "Yes" and "No" for each intervention.
You can use the function pivot_wider()
to achieve this:
library(tidyverse)
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df %>%
pivot_wider(names_from = Intervention,
values_from = Intervention) %>%
select(-c(`NA`)) %>%
mutate(across(.cols = Progesterone:Pessary,
.fns = ~case_when(is.na(.) ~ "No",
TRUE ~ "Yes")))
#> # A tibble: 5 × 6
#> IDs Other_data1 Other_data2 Progesterone Stitch Pessary
#> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 1 22 a Yes Yes No
#> 2 2 32 b No No No
#> 3 3 44 c No Yes No
#> 4 4 24 d Yes Yes Yes
#> 5 5 NA e Yes No No
Created on 2022-08-19 with reprex v2.0.2