I have a question considering merging rows in a dataframe:
I have seen a couple of questions regarding merging rows, however I have a hard time understanding them and applying them to my situation:
I have a dataframe with a structure like this:
person_id test_date serial_number freezer_number test_1 test_2 test_3 test_4
x 01/01/2010 c d positive NA NA NA
x 05/01/2010 a b NA positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
I want to merge the rows so that the data of the other columns remain intact (mainly the test date), however I want the rows of the test number and the person_id to merge so that the same individual is in 1 row with multiple tests.
This would be the ideal output:
person_id test_date serial_number freezer_number test_date2 test_1 test_2 test_3 test_4
x 01/01/2010 c d 05/01/2010 positive positive NA NA
y 02/02/2020 e f positive NA NA NA
......................................
How do I go about this? I have tried the "aggregate()
" functions before, however this is very unclear to me.
Any help is appreciated, I can give more information to clarify my current code and data frame!
CodePudding user response:
You could use summarize_all
, grouped by person_id
. This preserves the variables in each first row per person_id
not being NA
.
I added a pivot_wider
to preserve the different test_date
s (as pointed out by @Andrea M).
library(dplyr)
library(lubridate)
df1 <- df %>%
group_by(person_id) %>%
mutate(id = seq_along(person_id)) %>%
pivot_wider(names_from = id,
values_from = test_date,
names_prefix = "test_date") %>%
summarize_all(list(~ .[!is.na(.)][1]))
Output
> df1
# A tibble: 2 x 9
person_id serial_number freezer_number test_1 test_2 test_3 test_4 test_date1 test_date2
<chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr>
1 x c d positive positive NA NA 01/01/2010 05/01/2010
2 y e f positive NA NA NA 02/02/2020 NA
CodePudding user response:
What you're trying to do is reshaping the data from long format (one row per test) to wide format (one row per person, tests are in separate columns). This can be done in many ways, for example with tidyr::pivot_wider()
.
However there's a complicating factor - your dataset is not quite in long format because there are already multiple columns per test result. So you first need to fix that.
# Load libraries
library(tidyr)
library(dplyr)
library(stringr)
# Create dataset
df <- tribble(~person_id, ~test_date, ~serial_number, ~freezer_number, ~test_1, ~test_2, ~test_3, ~test_4,
"x", "01/01/2010", "c", "d", "positive", NA, NA, NA,
"x", "05/01/2010", "a", "b", NA, "positive", NA, NA,
"y", "02/02/2020", "e", "f", "positive", NA, NA, NA)
df2 <- df %>%
# Add a column indicating test number
group_by(person_id) %>%
mutate(test_number = row_number(),
# Gather the test results into a single column
test_result = paste0(test_1, test_2, test_3, test_4) %>%
str_remove_all("NA")) %>%
select(-(test_1:test_4)) %>%
# Reshape from long to wide
pivot_wider(names_from = test_number,
values_from = c(test_date, serial_number,
freezer_number, test_result)) %>%
# Reorder the columns
relocate(ends_with("1"), .before = ends_with("2"))
df2