Home > other >  remove duplicates but keep the row based on a specific column
remove duplicates but keep the row based on a specific column

Time:06-16

I have a large dataset that was built by combining data from multiple sources. Hence, there are a number of rows that are duplicates. I know how to remove duplicates using dplyr and distinct but I would like to have it always keep the row based on a specific value in a cell (source file). Essentially we have a ranking of which sources we prefer. Below is a very simplified dataset to use as an example:

mydata = data.frame (species =c ('myli','myli','myli','myli','myli','stili','stili','stili'),
                     count = c (10,10,15,15,12,10,10,10),
                     year =c(2020,2020,2021,2021,2019,2017,2017,2018),
                     source =c('zd','steam','ted','steam','zd','steam','ted','steam'))
    
    
    mydata

  species count year source
1    myli    10 2020     zd
2    myli    10 2020  steam
3    myli    15 2021    ted
4    myli    15 2021  steam
5    myli    12 2019     zd
6   stili    10 2017  steam
7   stili    10 2017    ted
8   stili    10 2018  steam

I do the following to remove the duplicates:

library(dplyr)
 
# Remove duplicate rows of the dataframe using 'species', 'count', and 'year' variables
distinct(mydata, species, count, year, .keep_all= TRUE)

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017  steam
5   stili    10 2018  steam

However, I want to ensure that the rows that are kept when there are duplicates prioritize the 'source' in the following order: zd > ted > steam so the final table looks like:

  species count year source
1    myli    10 2020     zd
2    myli    15 2021    ted
3    myli    12 2019     zd
4   stili    10 2017    ted
5   stili    10 2018  steam

So essentially the original rows '1', '3','5', '7' and '8' are kept and the duplicate rows '2','4', and '6' are dropped.

I appreciate any suggestions on how to do that last step to prioritize which original row to keep of the duplicated rows.

Thank you very much, Amanda

CodePudding user response:

Since your prioritization happens to be in reverse alphabetical order, in this case you can simply arrange(desc(source)) prior to your distinct() call

mydata %>% 
  arrange(desc(source)) %>% 
  distinct(species,count,year,.keep_all = T)

Output

  species count year source
1    myli    10 2020     zd
2    myli    12 2019     zd
3    myli    15 2021    ted
4   stili    10 2017    ted
5   stili    10 2018  steam

CodePudding user response:

Distinct respect the ordering. So as your criteria is alphabetically ordered* you can do it as simple as this:

mydata |>
  arrange(desc(source)) |>
  distinct(species, count, year, .keep_all= TRUE)

.* In other cases you'd need to make a variable with the order.

  • Related