Home > front end >  Merging rows in a dataframe R with duplicate id's
Merging rows in a dataframe R with duplicate id's

Time:03-21

I have a question considering merging rows in a dataframe:

I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:

I have a dataframe with a structure like this:

person_id    test_date    serial_number   freezer_number   test_1 test_2 test_3 test_4
x             01/01/2010  c               d                positive NA   NA      NA
x             05/01/2010  a               b                NA   positive NA      NA
y             02/02/2020  e               f                positive NA   NA      NA
......................................

I want to merge the rows so that the data of the other columns remain intact (mainly the test date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.

This would be the ideal output:

person_id    test_date    serial_number   freezer_number  test_date2  test_1 test_2  test_3 test_4
x             01/01/2010  c               d               05/01/2010  positive positive NA      NA
y             02/02/2020  e               f                           positive NA   NA      NA
......................................

How do I go about this? I have tried the "aggregate()" functions before, however this is very unclear to me.

Any help is appreciated, I can give more information to clarify my current code and data frame!

CodePudding user response:

You could use summarize_all, grouped by person_id. This preserves the variables in each first row per person_id not being NA. I added a pivot_wider to preserve the different test_dates (as pointed out by @Andrea M).

library(dplyr)
library(lubridate)

df1 <- df %>%
  group_by(person_id) %>% 
  mutate(id = seq_along(person_id)) %>% 
  pivot_wider(names_from = id,
              values_from = test_date,
              names_prefix = "test_date") %>%  
  summarize_all(list(~ .[!is.na(.)][1]))

Output

> df1
# A tibble: 2 x 9
  person_id serial_number freezer_number test_1   test_2   test_3 test_4 test_date1 test_date2
  <chr>     <chr>         <chr>          <chr>    <chr>    <lgl>  <lgl>  <chr>      <chr>     
1 x         c             d              positive positive NA     NA     01/01/2010 05/01/2010
2 y         e             f              positive NA       NA     NA     02/02/2020 NA

CodePudding user response:

What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider().

However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.

        # Load libraries
library(tidyr)
library(dplyr)
library(stringr)

        # Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)

df2 <- df %>% 
        # Add a column indicating test number
        group_by(person_id) %>%
        mutate(test_number = row_number(),
        # Gather the test results into a single column
               test_result = paste0(test_1, test_2, test_3, test_4) %>%
                      str_remove_all("NA")) %>%
        select(-(test_1:test_4)) %>%
        # Reshape from long to wide
        pivot_wider(names_from = test_number,
                    values_from = c(test_date, serial_number,
                                    freezer_number, test_result)) %>% 
        # Reorder the columns
        relocate(ends_with("1"), .before = ends_with("2"))

df2
  •  Tags:  
  • r
  • Related