I have a large dataset with two columns (here I'll give an sample as example), which corresponds to complete species' name with authority, and other with only the species' name. I would like to create a new column with the non match between these two above, that is only the authority.
Data sample:
Column_A Column_B
Crocidura jacksoni Thomas, 1904 Crocidura jacksoni
Pelomys fallax (Peters, 1852) Pelomys fallax
Ictonyx striatus (Perry, 1810) Ictonyx striatus
Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus
I am currently done the following using dplyr and stringr, as I saw in other question here:
df$New_column <- df %>% filter(str_detected(Column_B, Column_A, negative=TRUE))
But I've been got this error:
Error in UseMethod("filter") : no applicable method for 'filter' applied to an object of class "logical"
...or some of my columns aren't recognized.
This is the desired result:
Column_A Column_B New_column
Crocidura jacksoni Thomas, 1904 Crocidura jacksoni Thomas, 1904
Pelomys fallax (Peters, 1852) Pelomys fallax (Peters, 1852)
Ictonyx striatus (Perry, 1810) Ictonyx striatus (Perry, 1810)
Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus (É.Geoffroy, 1803)
Thank you in advance.
CodePudding user response:
You could remove the detect pattern using str_remove
like this (Added some space to the pattern using paste0
to remove this from output):
library(stringr)
library(dplyr)
df %>%
mutate(New_column = str_remove(Column_A, paste0(Column_B, " ")))
#> # A tibble: 4 × 3
#> Column_A Column_B New_column
#> <chr> <chr> <chr>
#> 1 Crocidura jacksoni Thomas, 1904 Crocidura jacksoni Thomas, 1904
#> 2 Pelomys fallax (Peters, 1852) Pelomys fallax (Peters, 1852)
#> 3 Ictonyx striatus (Perry, 1810) Ictonyx striatus (Perry, 1810)
#> 4 Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus (É.Geoffroy, 1803)
Created on 2023-01-11 with reprex v2.0.2
Data:
library(tibble)
df <- tribble(~Column_A , ~Column_B ,
"Crocidura jacksoni Thomas, 1904" , "Crocidura jacksoni" ,
"Pelomys fallax (Peters, 1852)" , "Pelomys fallax",
"Ictonyx striatus (Perry, 1810)" , "Ictonyx striatus",
"Acomys cahirinus (É.Geoffroy, 1803)", "Acomys cahirinus")
CodePudding user response:
An alternative solution that not only gets rid of the superfluous whitespace but also removes the parentheses could be with trimws
:
library(tidyverse)
df %>%
mutate(New_column = trimws(str_remove(Column_A, Column_B),
whitespace = "[\\s()]"))
# A tibble: 4 × 3
Column_A Column_B New_column
<chr> <chr> <chr>
1 Crocidura jacksoni Thomas, 1904 Crocidura jacksoni Thomas, 1904
2 Pelomys fallax (Peters, 1852) Pelomys fallax Peters, 1852
3 Ictonyx striatus (Perry, 1810) Ictonyx striatus Perry, 1810
4 Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus É.Geoffroy, 1803