Home > Software engineering >  Extract the difference between two strings from a table and write it in a new column in R
Extract the difference between two strings from a table and write it in a new column in R

Time:04-13

I am new to coding and I am trying to solve this problem in R:

I got two columns of a table which are filled with string chains with unequal length. The elements of the chain are separated by a symbol. Now I want to extract the difference of the two string chains for each row and write it to a new column.

Stringchain 1 Stringchain 2 Result
A00;B01;C02;D03 A00;B01;C02;D03;E04 E04
E04;F05;G06;H07 F05;G06;H07;I08 E04;I08
.... .... ....

I came to an result when I only compare 2 string chains by tokenizing each string and writing the result into an vector. Then I used the function setdiff().

Library(tokenizers)

Step_1 <- unlist(tokenize_paragraphs(string_chain_1, ";")

Step_2 <- unlist(tokenize_paragraphs(string_chain_2, ";")

Step_3 <- setdiff(Step_1, Step_2)

Step_4 <- setdiff(Step_2, Step_1)

Step_5 <- c(Step_3, Step4)

But I don't know how to do it for each row in a table. Someone any ideas?

CodePudding user response:

Here is something similar to your approach using the tidyverse. I created two dummy columns vec_1 and vec_2 where I converted the strings to list columns of vectors. The trick is to use map (or lapply from base) to operate over each row.

library(dplyr)
library(tokenizers)
library(purrr)

df %>% 
  mutate(vec_1 = map(Stringchain_1, tokenize_regex, pattern = ";", simplify = TRUE),
         vec_2 = map(Stringchain_2, tokenize_regex, pattern = ";", simplify = TRUE),
         Result = map2_chr(vec_1, vec_2,
                           ~ paste(c(setdiff(.x, .y), setdiff(.y, .x)),
                                   collapse = ";")))

This gives you the result. You can now drop any unneeded columns.

# A tibble: 2 x 5
  Stringchain_1   Stringchain_2       vec_1     vec_2     Result 
  <chr>           <chr>               <list>    <list>    <chr>  
1 A00;B01;C02;D03 A00;B01;C02;D03;E04 <chr [4]> <chr [5]> E04    
2 E04;F05;G06;H07 F05;G06;H07;I08     <chr [4]> <chr [4]> E04;I08

CodePudding user response:

We may do this in base R by splitting both columns with strsplit, get the setdiff by looping over the corresponding list elements with Map and paste

df1$Result <- unlist(Map(function(x, y) paste(sort(union(setdiff(y, x), 
    setdiff(x, y))), collapse = ";"), 
    strsplit(df1$Stringchain1, ";"), strsplit(df1$Stringchain2, ";")))

-output

> df1
     Stringchain1        Stringchain2  Result
1 A00;B01;C02;D03 A00;B01;C02;D03;E04     E04
2 E04;F05;G06;H07     F05;G06;H07;I08 E04;I08

data

df1 <- structure(list(Stringchain1 = c("A00;B01;C02;D03", "E04;F05;G06;H07"
), Stringchain2 = c("A00;B01;C02;D03;E04", "F05;G06;H07;I08")), row.names = c(NA, 
-2L), class = "data.frame")
  • Related