If a left join in R returns multiple matches, creating duplicates in one column, choose which to kee-CodePudding

I have two data frames: plant_names- species names, and plant_data - species names, species IDs, and name origin (if it is the main name or a synonym). Sometimes in plant_data one species name is listed as both a main name and a synonym for another species, and sometimes it is listed as a synonym for two separate species. I want to left_join the two dfs, but when multiple matches are returned, I want to keep only scientific names and not synonyms.

this is what the plant_data looks like

plant_data <- data.frame(scientific_name = c("CAREX FOENEA", 
                                             "CAREX SICCATA", "CAREX FOENEA", 
                                             "ABIES BIFOLIA", "ABIES MENZIESII", 
                                             "CAREX ECHINATA", "CAREX MURICATA",
                                             "CAREX ANGUSTIOR", "CAREX MURICATA"),
                         name_origin = c("scientific_name", 
                                         "scientific_name", "synonym", 
                                         "scientific_name", "synonym",
                                         "scientific_name", "synonym",
                                         "scientific_name", "synonym"),
                         id = c(1, 2, 2, 3, 3, 4, 4, 5, 5))

#looks like this
  scientific_name     name_origin id
1    CAREX FOENEA scientific_name  1
2   CAREX SICCATA scientific_name  2
3    CAREX FOENEA         synonym  2
4   ABIES BIFOLIA scientific_name  3
5 ABIES MENZIESII         synonym  3
6  CAREX ECHINATA scientific_name  4
7  CAREX MURICATA         synonym  4
8 CAREX ANGUSTIOR scientific_name  5
9  CAREX MURICATA         synonym  5

When a name is listed multiple times in plant_data, the left join returns multiple rows for that species.

library(dplyr)

#plant names
plant_names <- data.frame(scientific_name = c("CAREX FOENEA", "CAREX FOENEA", 
                                              "ABIES MENZIESII", "ABIES MENZIESII", 
                                              "CAREX MURICATA"))

#joining
joined_df <- left_join(plant_names, plant_data, by = "scientific_name")

#produces this
  scientific_name     name_origin id
1    CAREX FOENEA scientific_name  1
2    CAREX FOENEA         synonym  2
3    CAREX FOENEA scientific_name  1
4    CAREX FOENEA         synonym  2
5 ABIES MENZIESII         synonym  3
6 ABIES MENZIESII         synonym  3
7  CAREX MURICATA         synonym  4
8  CAREX MURICATA         synonym  5

What I want to do is implement the following rules for the data:

When a species name returns a main species and a synonym to a different species during the joining process, just keep the main species
When a species name returns two synonyms from two different species, delete them both
If a duplicate in the result is caused by plant_names having duplicates, keep them-even if both are synonyms.

My approach so far has been to accept the results of the join and try to get rid of rows I don't want by filtering out observations that are duplicated in scientific_name but NOT duplicated in id AND they are listed as synonyms....but nothing is quite working yet because I don't want to get rid of duplicates caused my duplicates in plant_names

here's what I'm trying

data_clean <- joined_df %>%
  dplyr::mutate(name_dup = (duplicated(scientific_name) | duplicated(scientific_name, fromLast = TRUE)) & 
                  (!duplicated(id) & !duplicated(id, fromLast = TRUE))) %>%
  dplyr::filter(!name_dup | (name_dup & .data$name_origin == "scientific_name"))

data_clean
  scientific_name     name_origin id name_dup
1    CAREX FOENEA scientific_name  1    FALSE
2    CAREX FOENEA         synonym  2    FALSE
3    CAREX FOENEA scientific_name  1    FALSE
4    CAREX FOENEA         synonym  2    FALSE
5 ABIES MENZIESII         synonym  4    FALSE
6 ABIES MENZIESII         synonym  4    FALSE

What I want is

  scientific_name     name_origin id
1    CAREX FOENEA scientific_name  1
2    CAREX FOENEA scientific_name  1
3 ABIES MENZIESII         synonym  4
4 ABIES MENZIESII         synonym  4

Any help would be greatly appreciated!

CodePudding user response：

Here, I add a row number to plant_names so that we can sort within each row's joined rows and keep the scientific_name one, if available.

joined_df <- left_join(mutate(plant_names, row = row_number()), plant_data) %>%
  group_by(row) %>%
  slice_min(name_origin != "scientific_name") %>% # FALSE will be first, ie is "min"
  ungroup()

Result

# A tibble: 4 × 4
  scientific_name   row name_origin        id
  <chr>           <int> <chr>           <dbl>
1 CAREX FOENEA        1 scientific_name     1
2 CAREX FOENEA        2 scientific_name     1
3 ABIES MENZIESII     3 synonym             4
4 ABIES MENZIESII     4 synonym             4

CodePudding user response：

I think I found a solution based on @jon's answer!

joined_df <- left_join(mutate(plant_names, row = row_number()), plant_data) %>%
  group_by(row) %>%
  mutate(dup = n() > 1) %>% 
  ungroup() %>% 
  filter(!dup | name_origin == "scientific_name") %>% 
  select(-dup)

# A tibble: 4 × 4
  scientific_name   row name_origin        id
  <chr>           <int> <chr>           <dbl>
1 CAREX FOENEA        1 scientific_name     1
2 CAREX FOENEA        2 scientific_name     1
3 ABIES MENZIESII     3 synonym             3
4 ABIES MENZIESII     4 synonym             3