Home > front end >  Merge rows in 1 df by ID and create new columns of differing data in R (in very large df)
Merge rows in 1 df by ID and create new columns of differing data in R (in very large df)

Time:08-20

I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique() or distinct() because then some data would be erased, as the rows are not identical.

Please see this example:

# The style of df I have 
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
                 Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
                 Other_data1= c(22,22,32,44,24,24,24,NA), 
                 Other_data2=c("a","a","b","c","d","d","d","e"))
df
#   IDs Intervention Other_data1 Other_data2
# 1   1 Progesterone          22           a
# 2   1       Stitch          22           a
# 3   2         <NA>          32           b
# 4   3       Stitch          44           c
# 5   4 Progesterone          24           d
# 6   4       Stitch          24           d
# 7   4      Pessary          24           d
# 8   5 Progesterone          NA           e

So if I used unique() I would lose the full information in the df$Intervention column.

Please could someone let me know how I can get the df into this format:

# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
                        Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
                        Stitch=c("Yes", NA, "Yes", "Yes", "No"),
                        Pessary=c("No", NA, "No", "Yes", "No"),
                        Other_data1= c(22,32,44,24,NA), 
                        Other_data2=c("a","b","c","d","e"))
df_I_want
#   IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1   1          Yes    Yes      No          22           a
# 2   2         <NA>   <NA>    <NA>          32           b
# 3   3           No    Yes      No          44           c
# 4   4          Yes    Yes     Yes          24           d
# 5   5          Yes     No      No          NA           e

My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention.

CodePudding user response:

Here is another pivot_wider solution, but here I use mutate and case_when to identify their corresponding values under the newly expanded columns.

If all of the three newly expanded columns are NA, they should remain NA. Otherwise, treat NA as "No" and non-NA as "Yes".

Note that within across(), you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone, Stitch and Pessary are newly created, and they are in position 4 to 6, therefore 4:6).

Edit: Added length(unique(na.omit(df$Intervention))) when calculating and comparing the number of NAs across the newly expanded columns so that it's more dynamic

library(tidyverse)

df %>% 
  pivot_wider(names_from = Intervention, values_from = Intervention) %>% 
  select(-"NA") %>% 
  mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == length(unique(na.omit(df$Intervention))) ~ NA_character_,
                                is.na(.x) ~ "No",
                                !is.na(.x) ~ "Yes")))

# A tibble: 5 × 6
    IDs Other_data1 Other_data2 Progesterone Stitch Pessary
  <dbl>       <dbl> <chr>       <chr>        <chr>  <chr>  
1     1          22 a           Yes          Yes    No     
2     2          32 b           NA           NA     NA     
3     3          44 c           No           Yes    No     
4     4          24 d           Yes          Yes    Yes    
5     5          NA e           Yes          No     No     

CodePudding user response:

Updated: I have now updated the code so that you have "Yes" and "No" for each intervention.

You can use the function pivot_wider() to achieve this:

library(tidyverse)
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
                 Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
                 Other_data1= c(22,22,32,44,24,24,24,NA), 
                 Other_data2=c("a","a","b","c","d","d","d","e"))

df %>%
  pivot_wider(names_from = Intervention, 
              values_from = Intervention) %>%
  select(-c(`NA`)) %>%
  mutate(across(.cols = Progesterone:Pessary, 
                .fns = ~case_when(is.na(.) ~ "No",
                                  TRUE ~ "Yes")))
#> # A tibble: 5 × 6
#>     IDs Other_data1 Other_data2 Progesterone Stitch Pessary
#>   <dbl>       <dbl> <chr>       <chr>        <chr>  <chr>  
#> 1     1          22 a           Yes          Yes    No     
#> 2     2          32 b           No           No     No     
#> 3     3          44 c           No           Yes    No     
#> 4     4          24 d           Yes          Yes    Yes    
#> 5     5          NA e           Yes          No     No

Created on 2022-08-19 with reprex v2.0.2

  • Related