I have a large dataset that was built by combining data from multiple sources. Hence, there are a number of rows that are duplicates. I know how to remove duplicates using dplyr and distinct but I would like to have it always keep the row based on a specific value in a cell (source file). Essentially we have a ranking of which sources we prefer. Below is a very simplified dataset to use as an example:
mydata = data.frame (species =c ('myli','myli','myli','myli','myli','stili','stili','stili'),
count = c (10,10,15,15,12,10,10,10),
year =c(2020,2020,2021,2021,2019,2017,2017,2018),
source =c('zd','steam','ted','steam','zd','steam','ted','steam'))
mydata
species count year source
1 myli 10 2020 zd
2 myli 10 2020 steam
3 myli 15 2021 ted
4 myli 15 2021 steam
5 myli 12 2019 zd
6 stili 10 2017 steam
7 stili 10 2017 ted
8 stili 10 2018 steam
I do the following to remove the duplicates:
library(dplyr)
# Remove duplicate rows of the dataframe using 'species', 'count', and 'year' variables
distinct(mydata, species, count, year, .keep_all= TRUE)
species count year source
1 myli 10 2020 zd
2 myli 15 2021 ted
3 myli 12 2019 zd
4 stili 10 2017 steam
5 stili 10 2018 steam
However, I want to ensure that the rows that are kept when there are duplicates prioritize the 'source' in the following order: zd > ted > steam so the final table looks like:
species count year source
1 myli 10 2020 zd
2 myli 15 2021 ted
3 myli 12 2019 zd
4 stili 10 2017 ted
5 stili 10 2018 steam
So essentially the original rows '1', '3','5', '7' and '8' are kept and the duplicate rows '2','4', and '6' are dropped.
I appreciate any suggestions on how to do that last step to prioritize which original row to keep of the duplicated rows.
Thank you very much, Amanda
CodePudding user response:
Since your prioritization happens to be in reverse alphabetical order, in this case you can simply arrange(desc(source))
prior to your distinct()
call
mydata %>%
arrange(desc(source)) %>%
distinct(species,count,year,.keep_all = T)
Output
species count year source
1 myli 10 2020 zd
2 myli 12 2019 zd
3 myli 15 2021 ted
4 stili 10 2017 ted
5 stili 10 2018 steam
CodePudding user response:
Distinct respect the ordering. So as your criteria is alphabetically ordered* you can do it as simple as this:
mydata |>
arrange(desc(source)) |>
distinct(species, count, year, .keep_all= TRUE)
.* In other cases you'd need to make a variable with the order.