I have two data frames: plant_names
- species names, and plant_data
- species names, species IDs, and name origin (if it is the main name or a synonym). Sometimes in plant_data
one species name is listed as both a main name and a synonym for another species, and sometimes it is listed as a synonym for two separate species. I want to left_join the two dfs, but when multiple matches are returned, I want to keep only scientific names and not synonyms.
this is what the plant_data
looks like
plant_data <- data.frame(scientific_name = c("CAREX FOENEA",
"CAREX SICCATA", "CAREX FOENEA",
"ABIES BIFOLIA", "ABIES MENZIESII",
"CAREX ECHINATA", "CAREX MURICATA",
"CAREX ANGUSTIOR", "CAREX MURICATA"),
name_origin = c("scientific_name",
"scientific_name", "synonym",
"scientific_name", "synonym",
"scientific_name", "synonym",
"scientific_name", "synonym"),
id = c(1, 2, 2, 3, 3, 4, 4, 5, 5))
#looks like this
scientific_name name_origin id
1 CAREX FOENEA scientific_name 1
2 CAREX SICCATA scientific_name 2
3 CAREX FOENEA synonym 2
4 ABIES BIFOLIA scientific_name 3
5 ABIES MENZIESII synonym 3
6 CAREX ECHINATA scientific_name 4
7 CAREX MURICATA synonym 4
8 CAREX ANGUSTIOR scientific_name 5
9 CAREX MURICATA synonym 5
When a name is listed multiple times in plant_data
, the left join returns multiple rows for that species.
library(dplyr)
#plant names
plant_names <- data.frame(scientific_name = c("CAREX FOENEA", "CAREX FOENEA",
"ABIES MENZIESII", "ABIES MENZIESII",
"CAREX MURICATA"))
#joining
joined_df <- left_join(plant_names, plant_data, by = "scientific_name")
#produces this
scientific_name name_origin id
1 CAREX FOENEA scientific_name 1
2 CAREX FOENEA synonym 2
3 CAREX FOENEA scientific_name 1
4 CAREX FOENEA synonym 2
5 ABIES MENZIESII synonym 3
6 ABIES MENZIESII synonym 3
7 CAREX MURICATA synonym 4
8 CAREX MURICATA synonym 5
What I want to do is implement the following rules for the data:
- When a species name returns a main species and a synonym to a different species during the joining process, just keep the main species
- When a species name returns two synonyms from two different species, delete them both
- If a duplicate in the result is caused by
plant_names
having duplicates, keep them-even if both are synonyms.
My approach so far has been to accept the results of the join and try to get rid of rows I don't want by filtering out observations that are duplicated in scientific_name
but NOT duplicated in id
AND they are listed as synonyms....but nothing is quite working yet because I don't want to get rid of duplicates caused my duplicates in plant_names
here's what I'm trying
data_clean <- joined_df %>%
dplyr::mutate(name_dup = (duplicated(scientific_name) | duplicated(scientific_name, fromLast = TRUE)) &
(!duplicated(id) & !duplicated(id, fromLast = TRUE))) %>%
dplyr::filter(!name_dup | (name_dup & .data$name_origin == "scientific_name"))
data_clean
scientific_name name_origin id name_dup
1 CAREX FOENEA scientific_name 1 FALSE
2 CAREX FOENEA synonym 2 FALSE
3 CAREX FOENEA scientific_name 1 FALSE
4 CAREX FOENEA synonym 2 FALSE
5 ABIES MENZIESII synonym 4 FALSE
6 ABIES MENZIESII synonym 4 FALSE
What I want is
scientific_name name_origin id
1 CAREX FOENEA scientific_name 1
2 CAREX FOENEA scientific_name 1
3 ABIES MENZIESII synonym 4
4 ABIES MENZIESII synonym 4
Any help would be greatly appreciated!
CodePudding user response:
Here, I add a row number to plant_names
so that we can sort within each row's joined rows and keep the scientific_name one, if available.
joined_df <- left_join(mutate(plant_names, row = row_number()), plant_data) %>%
group_by(row) %>%
slice_min(name_origin != "scientific_name") %>% # FALSE will be first, ie is "min"
ungroup()
Result
# A tibble: 4 × 4
scientific_name row name_origin id
<chr> <int> <chr> <dbl>
1 CAREX FOENEA 1 scientific_name 1
2 CAREX FOENEA 2 scientific_name 1
3 ABIES MENZIESII 3 synonym 4
4 ABIES MENZIESII 4 synonym 4
CodePudding user response:
I think I found a solution based on @jon's answer!
joined_df <- left_join(mutate(plant_names, row = row_number()), plant_data) %>%
group_by(row) %>%
mutate(dup = n() > 1) %>%
ungroup() %>%
filter(!dup | name_origin == "scientific_name") %>%
select(-dup)
# A tibble: 4 × 4
scientific_name row name_origin id
<chr> <int> <chr> <dbl>
1 CAREX FOENEA 1 scientific_name 1
2 CAREX FOENEA 2 scientific_name 1
3 ABIES MENZIESII 3 synonym 3
4 ABIES MENZIESII 4 synonym 3