Home > OS >  Filter based on non match of string between columns in R
Filter based on non match of string between columns in R

Time:01-12

I have a large dataset with two columns (here I'll give an sample as example), which corresponds to complete species' name with authority, and other with only the species' name. I would like to create a new column with the non match between these two above, that is only the authority.

Data sample:

Column_A                              Column_B               
Crocidura jacksoni Thomas, 1904       Crocidura jacksoni     
Pelomys fallax (Peters, 1852)         Pelomys fallax         
Ictonyx striatus (Perry, 1810)        Ictonyx striatus       
Acomys cahirinus (É.Geoffroy, 1803)   Acomys cahirinus       

I am currently done the following using dplyr and stringr, as I saw in other question here:

df$New_column <- df %>% filter(str_detected(Column_B, Column_A, negative=TRUE))

But I've been got this error:

Error in UseMethod("filter") : no applicable method for 'filter' applied to an object of class "logical"

...or some of my columns aren't recognized.

This is the desired result:

Column_A                              Column_B               New_column
Crocidura jacksoni Thomas, 1904       Crocidura jacksoni     Thomas, 1904
Pelomys fallax (Peters, 1852)         Pelomys fallax         (Peters, 1852)
Ictonyx striatus (Perry, 1810)        Ictonyx striatus       (Perry, 1810)
Acomys cahirinus (É.Geoffroy, 1803)   Acomys cahirinus       (É.Geoffroy, 1803)

Thank you in advance.

CodePudding user response:

You could remove the detect pattern using str_remove like this (Added some space to the pattern using paste0 to remove this from output):

library(stringr)
library(dplyr)

df %>%
  mutate(New_column = str_remove(Column_A, paste0(Column_B, " ")))
#> # A tibble: 4 × 3
#>   Column_A                            Column_B           New_column        
#>   <chr>                               <chr>              <chr>             
#> 1 Crocidura jacksoni Thomas, 1904     Crocidura jacksoni Thomas, 1904      
#> 2 Pelomys fallax (Peters, 1852)       Pelomys fallax     (Peters, 1852)    
#> 3 Ictonyx striatus (Perry, 1810)      Ictonyx striatus   (Perry, 1810)     
#> 4 Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus   (É.Geoffroy, 1803)

Created on 2023-01-11 with reprex v2.0.2


Data:

library(tibble)
df <- tribble(~Column_A ,                            ~Column_B   ,            
"Crocidura jacksoni Thomas, 1904"   ,    "Crocidura jacksoni"  ,   
"Pelomys fallax (Peters, 1852)" ,        "Pelomys fallax",         
"Ictonyx striatus (Perry, 1810)" ,      "Ictonyx striatus",       
"Acomys cahirinus (É.Geoffroy, 1803)",   "Acomys cahirinus")

CodePudding user response:

An alternative solution that not only gets rid of the superfluous whitespace but also removes the parentheses could be with trimws:

library(tidyverse)
df %>%
  mutate(New_column = trimws(str_remove(Column_A, Column_B), 
                             whitespace = "[\\s()]")) 
# A tibble: 4 × 3
  Column_A                            Column_B           New_column      
  <chr>                               <chr>              <chr>           
1 Crocidura jacksoni Thomas, 1904     Crocidura jacksoni Thomas, 1904    
2 Pelomys fallax (Peters, 1852)       Pelomys fallax     Peters, 1852    
3 Ictonyx striatus (Perry, 1810)      Ictonyx striatus   Perry, 1810     
4 Acomys cahirinus (É.Geoffroy, 1803) Acomys cahirinus   É.Geoffroy, 1803
  • Related