I have a dataset I'm working with in R that has an ID number (ID), year they submitted data (Year), some other data (which isn't relevant to my question, but just consider them as "columns"), and a date of registration on our systems (DateR).
This dateR is autogenerated from the dataset I am using, and is supposed to represent the "earliest" date the ID number appears on our systems.
However, due to some kind problem with how the data is being pulled that I can't get fixed, the date is being recorded as a new date that updates every year, instead of simply the earliest date.
Thus, my goal would be to create a script that reworks the data and does the following two checks:
Firstly, it checks the row and identifies which rows have the matching ID number Secondly, it then applies the "earliest" date of all of the matching ID numbers in the date column.
So below is the example of a Dataset like what I am using
# | ID1 | YearSubmitted | Data | DateR |
---|---|---|---|---|
1 | 12345 | 2017 | 100 | 22-03-2017 |
2 | 12345 | 2018 | 100 | 22-03-2018 |
3 | 12345 | 2019 | 100 | 22-03-2019 |
4 | 22221 | 2018 | 100 | 22-03-2018 |
5 | 22221 | 2019 | 100 | 22-03-2019 |
This is what I would like it to look like (I have bolded the changed numbers for clarity)
# | ID1 | YearSubmitted | Data | DateR |
---|---|---|---|---|
1 | 12345 | 2017 | 100 | 22-03-2017 |
2 | 12345 | 2018 | 100 | 22-03-2017 |
3 | 12345 | 2019 | 100 | 22-03-2017 |
4 | 22221 | 2018 | 100 | 22-03-2018 |
5 | 22221 | 2019 | 100 | 22-03-2018 |
Most of the reference questions I have searched for this reference either replacing data with values fromanother column like If data present, replace with data from another column based on row ID, or use the replacement value as pulled from another dataframe like Replace a value in a dataframe by using other matching IDs of another dataframe in R.
I would prefer to acheive in dplyr if possible.
Preferably I'd like to start this with
data %>% group_by(ID1, Yearsubmitted) %>% mutate(across(c(DateR),
And I understand I could use the match function .. but I just draw a blank from this point on.
Thus, I would appreciate advice on how to:
Conditionally change the date if it's matching ID1 values, and secondly, to change all dates to the earliest value in the date column (DateR).
Thanks for your time.
CodePudding user response:
Try this:
quux %>%
mutate(DateR = as.Date(DateR, format = "%d-%m-%Y")) %>%
group_by(ID1) %>%
mutate(DateR = min(DateR)) %>%
ungroup()
# # A tibble: 5 × 5
# `#` ID1 YearSubmitted Data DateR
# <int> <int> <int> <int> <date>
# 1 1 12345 2017 100 2017-03-22
# 2 2 12345 2018 100 2017-03-22
# 3 3 12345 2019 100 2017-03-22
# 4 4 22221 2018 100 2018-03-22
# 5 5 22221 2019 100 2018-03-22
This involves converting DateR
to a "real" Date
-class object, where numeric comparisons (such as min
) are unambiguous and correct.
Data
quux <- structure(list("#" = 1:5, ID1 = c(12345L, 12345L, 12345L, 22221L, 22221L), YearSubmitted = c(2017L, 2018L, 2019L, 2018L, 2019L), Data = c(100L, 100L, 100L, 100L, 100L), DateR = c("22-03-2017", "22-03-2018", "22-03-2019", "22-03-2018", "22-03-2019")), class = "data.frame", row.names = c(NA, -5L))
CodePudding user response:
Here is a similar approach using dplyr
s first
function after using arrange
to sort the years:
df %>%
group_by(ID1) %>%
arrange(YearSubmitted,.by_group = TRUE) %>%
mutate(DateR = first(DateR))
ID1 YearSubmitted Data DateR
<int> <int> <int> <chr>
1 12345 2017 100 22-03-2017
2 12345 2018 100 22-03-2017
3 12345 2019 100 22-03-2017
4 22221 2018 100 22-03-2018
5 22221 2019 100 22-03-2018