I am new to coding and I am trying to solve this problem in R:
I got two columns of a table which are filled with string chains with unequal length. The elements of the chain are separated by a symbol. Now I want to extract the difference of the two string chains for each row and write it to a new column.
Stringchain 1 | Stringchain 2 | Result |
---|---|---|
A00;B01;C02;D03 | A00;B01;C02;D03;E04 | E04 |
E04;F05;G06;H07 | F05;G06;H07;I08 | E04;I08 |
.... | .... | .... |
I came to an result when I only compare 2 string chains by tokenizing each string and writing the result into an vector. Then I used the function setdiff().
Library(tokenizers)
Step_1 <- unlist(tokenize_paragraphs(string_chain_1, ";")
Step_2 <- unlist(tokenize_paragraphs(string_chain_2, ";")
Step_3 <- setdiff(Step_1, Step_2)
Step_4 <- setdiff(Step_2, Step_1)
Step_5 <- c(Step_3, Step4)
But I don't know how to do it for each row in a table. Someone any ideas?
CodePudding user response:
Here is something similar to your approach using the tidyverse. I created two dummy columns vec_1
and vec_2
where I converted the strings to list columns of vectors. The trick is to use map (or lapply from base) to operate over each row.
library(dplyr)
library(tokenizers)
library(purrr)
df %>%
mutate(vec_1 = map(Stringchain_1, tokenize_regex, pattern = ";", simplify = TRUE),
vec_2 = map(Stringchain_2, tokenize_regex, pattern = ";", simplify = TRUE),
Result = map2_chr(vec_1, vec_2,
~ paste(c(setdiff(.x, .y), setdiff(.y, .x)),
collapse = ";")))
This gives you the result. You can now drop any unneeded columns.
# A tibble: 2 x 5
Stringchain_1 Stringchain_2 vec_1 vec_2 Result
<chr> <chr> <list> <list> <chr>
1 A00;B01;C02;D03 A00;B01;C02;D03;E04 <chr [4]> <chr [5]> E04
2 E04;F05;G06;H07 F05;G06;H07;I08 <chr [4]> <chr [4]> E04;I08
CodePudding user response:
We may do this in base R
by splitting both columns with strsplit
, get the setdiff
by looping over the corresponding list
elements with Map
and paste
df1$Result <- unlist(Map(function(x, y) paste(sort(union(setdiff(y, x),
setdiff(x, y))), collapse = ";"),
strsplit(df1$Stringchain1, ";"), strsplit(df1$Stringchain2, ";")))
-output
> df1
Stringchain1 Stringchain2 Result
1 A00;B01;C02;D03 A00;B01;C02;D03;E04 E04
2 E04;F05;G06;H07 F05;G06;H07;I08 E04;I08
data
df1 <- structure(list(Stringchain1 = c("A00;B01;C02;D03", "E04;F05;G06;H07"
), Stringchain2 = c("A00;B01;C02;D03;E04", "F05;G06;H07;I08")), row.names = c(NA,
-2L), class = "data.frame")