I have repeat rows of ethnicity data of random people with dates on when the ethnicity was assigned. I would like to assign each person only one ethnicity category (white, asian, black, other, mixed) based on these conditions: (1) if the person has multiple ethnicity, then the most common one (highest ethnicity_n) will be assigned. (2) if the patient has multiple ethnicity in equal numbers (e.g. 1 mixed, 1 Asian, 1 Other) then the most recent one will be assigned. I organized my data in a way so that I have a count of each ethnicity per patient and I wrote code to make the date of ethnicity in descending order. However, when I run code to get the first row of my organized table I end up with random ethnicity assigned for each person.
person | ethnicity | ethnicity_n | ethnicity_date |
---|---|---|---|
1 | white | 4 | 04/09/2019 |
1 | white | 4 | 04/09/2018 |
1 | white | 4 | 04/09/2017 |
1 | white | 4 | 04/09/2016 |
1 | other | 1 | 04/09/2015 |
2 | asian | 1 | 04/09/2019 |
2 | other | 1 | 04/09/2018 |
2 | mixed | 1 | 04/09/2017 |
3 | black | 2 | 04/09/2016 |
3 | black | 2 | 04/09/2015 |
I used this code to make the table above
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date))
I want the final table to look like this
person | ethnicity | ethnicity_n | ethnicity_date |
---|---|---|---|
1 | white | 4 | 04/09/2019 |
2 | asian | 1 | 04/09/2019 |
3 | black | 2 | 04/09/2016 |
I tried all of these codes to get the second table but each time the ethnicity is not meeting the conditions that i want
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
slice(1L)
df %>%
group_by(person) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
slice(1L)
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
filter(row_number()==1)
df %>%
group_by(person) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
filter(row_number()==1)
Data:
df <- structure(list(person = c(
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L
), ethnicity = c(
"white", "white", "white", "white", "other",
"asian", "other", "mixed", "black", "black"
), ethnicity_n = c(
4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L
), ethnicity_date = c(
"04/09/2019",
"04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015", "04/09/2019",
"04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015"
)), class = "data.frame", row.names = c(
NA,
-10L
))
CodePudding user response:
Your main problem was that your ethnicity_date
was a character vector, rather than a Date
.
I have assumed it's in month-day-year format but if it's day-month-year format you can just change format = "%m/%d/%Y"
to format = "%d/%m/%Y"
.
Note that as we are using max()
functions for ethnicity_date
and ethnicity_n
, it is not necessary to arrange()
the data in order first. You will need to supply na.rm=TRUE
to the max()
function if you have any NA
values in either column.
df |>
mutate(
ethnicity_date = as.Date(
ethnicity_date,
format = "%m/%d/%Y"
)
) |>
group_by(person) |>
filter(
ethnicity_n == max(ethnicity_n)
) |>
filter(
ethnicity_date == max(ethnicity_date)
) |>
slice(1L) # in case there are still ties
# # A tibble: 3 x 4
# # Groups: person [3]
# person ethnicity ethnicity_n ethnicity_date
# <int> <chr> <int> <date>
# 1 1 white 4 2019-04-09
# 2 2 asian 1 2019-04-09
# 3 3 black 2 2016-04-09
I kept your slice(1L)
at the end in case there is an individual with more than one row with the same ethnicity_n
and ethnicity_date
, but you can remove it if you want to keep both rows in that case.
CodePudding user response:
EDIT: Thank you for adding a sample dataset, that is much easier.
I think slice_max()
is what you are looking for, see below. It is also better to format the date properly first
library(dplyr)
df %>%
mutate(ethnicity_date = as.Date(ethnicity_date, format = "%d/%m/%Y")) %>%
group_by(person) %>%
slice_max(ethnicity_n) %>%
slice_max(ethnicity_date) %>%
ungroup()
# A tibble: 3 × 4
# person ethnicity ethnicity_n ethnicity_date
# <int> <chr> <int> <date>
# 1 1 white 4 2019-09-04
# 2 2 asian 1 2019-09-04
# 3 3 black 2 2016-09-04