Aggregating Rows based on multiple conditions (matching in one column, not matching in another)-CodePudding

I have a dataset I'm working with in R that has an ID number (ID), year they submitted data (Year), some other data (which isn't relevant to my question, but just consider them as "columns"), and a date of registration on our systems (DateR).

This dateR is autogenerated from the dataset I am using, and is supposed to represent the "earliest" date the ID number appears on our systems.

However, due to some kind problem with how the data is being pulled that I can't get fixed, the date is being recorded as a new date that updates every year, instead of simply the earliest date.

Thus, my goal would be to create a script that reworks the data and does the following two checks:

Firstly, it checks the row and identifies which rows have the matching ID number Secondly, it then applies the "earliest" date of all of the matching ID numbers in the date column.

So below is the example of a Dataset like what I am using

#	ID1	YearSubmitted	Data	DateR
1	12345	2017	100	22-03-2017
2	12345	2018	100	22-03-2018
3	12345	2019	100	22-03-2019
4	22221	2018	100	22-03-2018
5	22221	2019	100	22-03-2019

This is what I would like it to look like (I have bolded the changed numbers for clarity)

#	ID1	YearSubmitted	Data	DateR
1	12345	2017	100	22-03-2017
2	12345	2018	100	22-03-2017
3	12345	2019	100	22-03-2017
4	22221	2018	100	22-03-2018
5	22221	2019	100	22-03-2018

Most of the reference questions I have searched for this reference either replacing data with values fromanother column like If data present, replace with data from another column based on row ID, or use the replacement value as pulled from another dataframe like Replace a value in a dataframe by using other matching IDs of another dataframe in R.

I would prefer to acheive in dplyr if possible.

Preferably I'd like to start this with

data %>% group_by(ID1, Yearsubmitted) %>% mutate(across(c(DateR),

And I understand I could use the match function .. but I just draw a blank from this point on.

Thus, I would appreciate advice on how to:

Conditionally change the date if it's matching ID1 values, and secondly, to change all dates to the earliest value in the date column (DateR).

Thanks for your time.

CodePudding user response：

Try this:

quux %>%
  mutate(DateR = as.Date(DateR, format = "%d-%m-%Y")) %>%
  group_by(ID1) %>%
  mutate(DateR = min(DateR)) %>%
  ungroup()
# # A tibble: 5 × 5
#     `#`   ID1 YearSubmitted  Data DateR     
#   <int> <int>         <int> <int> <date>    
# 1     1 12345          2017   100 2017-03-22
# 2     2 12345          2018   100 2017-03-22
# 3     3 12345          2019   100 2017-03-22
# 4     4 22221          2018   100 2018-03-22
# 5     5 22221          2019   100 2018-03-22

This involves converting DateR to a "real" Date-class object, where numeric comparisons (such as min) are unambiguous and correct.

Data

quux <- structure(list("#" = 1:5, ID1 = c(12345L, 12345L, 12345L, 22221L, 22221L), YearSubmitted = c(2017L, 2018L, 2019L, 2018L, 2019L), Data = c(100L, 100L, 100L, 100L, 100L), DateR = c("22-03-2017", "22-03-2018", "22-03-2019", "22-03-2018", "22-03-2019")), class = "data.frame", row.names = c(NA, -5L))

CodePudding user response：

Here is a similar approach using dplyrs first function after using arrange to sort the years:

df %>% 
  group_by(ID1) %>% 
  arrange(YearSubmitted,.by_group = TRUE) %>% 
  mutate(DateR = first(DateR))

   ID1 YearSubmitted  Data DateR     
  <int>         <int> <int> <chr>     
1 12345          2017   100 22-03-2017
2 12345          2018   100 22-03-2017
3 12345          2019   100 22-03-2017
4 22221          2018   100 22-03-2018
5 22221          2019   100 22-03-2018