Determine (dis)similarity of multi-word strings on a word-by-word basis-CodePudding

I'm working on string distance in multi-word strings, as in this toy data:

df <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)

I'd like to determine the (dis)similarity of each row compared to the next row on a word-by-word basis. I use this code:

library(dplyr)
library(tidyr)
library(stringdist)
df %>%
  mutate(col2 = lead(col1, 1),
         id = row_number()) %>%
  pivot_longer(
    # select columns:
    cols = c(col1, col2),
    # determine name of new column:
    names_to = c(".value", "Col_N"), 
    # define capture groups (...) for new column:
    names_pattern = "^([a-z] )([0-9])$") %>%
  # separate each word into its own row:
  separate_rows(col, sep = "\\s") %>%
  # recast into wider format:
  pivot_wider(id_cols = c(id, Col_N), 
              names_from = Col_N, 
              values_from = col) %>%
  # unnest lists:
  unnest(.) %>%
  # calculate string distance:
  mutate(distance = stringdist(`1`, `2`)) %>%
  group_by(id) %>%
  # reconnect same-string words and distance values:
  summarise(col1 = str_c(unique(`1`), collapse = " "),
            col2 = str_c(unique(`2`), collapse = " "),
            distance = str_c(distance, collapse = ", "))
# A tibble: 5 x 4
     id col1         col2         distance
* <int> <chr>        <chr>        <chr>   
1     1 ab           ab bc        0, 2    
2     2 ab bc        yyyy         4, 4    
3     3 yyyy         yyyy pw hhhh 0, 4, 4 
4     4 yyyy pw hhhh wstjz        5, 5, 5 
5     5 wstjz        NA           NA

While the result seems to be okay, there are three problems with it: a) there are a number of warnings, b) the code seems quite convoluted, and c) distance is of type character. So I'm wondering if there's a better way to determine word-by-word the (dis)similiarity of strings?

CodePudding user response：

A solution:

df <- data.frame(
  col1 = col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz"),
  stringsAsFactors=FALSE
)

comps = function(a.row){
  paste(stringdist(unlist(strsplit(as.character(a.row[1]), ' ')), 
                   unlist(strsplit(as.character(a.row[2]), ' '))), 
        collapse = ' ')
  
}
df %>%
  mutate(col2 = lead(col1, 1)) %>%
         mutate(distance = apply(., 1, comps))

there should be a way to not have to use the as.character in the strsplit function
I'm not sure that you can have a column of vectors in a dataframe, this might be why all the warnings and the character type for the distance. I here cast the distance into a string to keep all the values in the same column.

CodePudding user response：

how about something like this:

mydf <- data.frame(
  col1 = c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "wstjz")
)
mydf


library(dplyr)
library(stringdist)
mydf %>% 
  mutate(col1_lead = lead(col1)) %>% 
  apply(1, function(x){
    stringdist(
      unlist(strsplit(x["col1"], " ")), 
      unlist(strsplit(x["col1_lead"], " "))
    )}
  ) %>% 
  cbind() %>% 
  `colnames<-`("distance") %>% 
  cbind(mydf)

CodePudding user response：

Below is my simple honesty idea.
I make list-cols having words and calculate dist row by row with unlist (because stringdist need vector). And keep the dist as list-column.

ans <- df %>%
  as_tibble() %>% 
  mutate(id = row_number(),   # not use
         col2 = lead(col1, 1),
         sep_col1 = str_split(col1, " "),
         sep_col2 = str_split(col2, " ")) %>%    # or str_split(lead(col1, 1))
  rowwise() %>% 
  mutate(dist = list(stringdist(unlist(sep_col1), unlist(sep_col2))),
         for_just_look = paste(dist, collapse = ", ")) %>% 
  ungroup()

ans

#  col1            id col2         sep_col1  sep_col2  dist     for_just_look
#  <chr>        <int> <chr>        <list>    <list>    <list>    <chr>   
# 1 ab               1 ab bc        <chr [1]> <chr [2]> <dbl [2]> 0, 2    
# 2 ab bc            2 yyyy         <chr [2]> <chr [1]> <dbl [2]> 4, 4    
# 3 yyyy             3 yyyy pw hhhh <chr [1]> <chr [3]> <dbl [3]> 0, 4, 4 
# 4 yyyy pw hhhh     4 wstjz        <chr [3]> <chr [1]> <dbl [3]> 5, 5, 5 
# 5 wstjz            5 NA           <chr [1]> <chr [1]> <dbl [1]> NA

CodePudding user response：

Without my comments below, just straightforward would be this.

library(data.table)
setDT(df)

df[, col1 := list(str_split(col1, " "))]
df[, col2 := lead(col1, 1)]
df[, distance := lapply(.I, function(x) { stringdist(col1[x][[1]], col2[x][[1]]) })]

Be carefull with any stringdist like function, on a huge dataset it is quite intense to make all comparisons. Also keep in mind what you are going to use the values distances for. Are you truly intestested in the disctance? Or are you interested in like all with a distance < x ? If so most likely a compared to axxxxxxxxxxxxxxx you do not consider a close match right, but you could see that difference by the length of the string for example which takes way less resources to calculate than the actual distance.

Also it would be a waste of computation to blindly compute row by row, lets just make a tiny longer sample set.

c("ab", "ab bc", "yyyy", "yyyy pw hhhh", "yyyy", "yyyy pw hhhh", "wstjz", "wstjz")

here you would calculate 3x the disctance between yyyy and yyyy which should be done once (well actually you should capture those by "is equal" first), 3x yyyy and hhhh / hhhh and yyyy.

With a small dataset you probably do not have to worry, but with large sets and longer strings... it can become messy / slow pretty fast.