Home > database >  Group strings that have the same words but in a different order
Group strings that have the same words but in a different order

Time:12-17

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.

Code for example data:

var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))

Have:

var1                 freq
BLUE|RED              1
RED|BLUE              1
WHITE|BLACK|ORANGE    1
BLACK|WHITE|ORANGE    1

How can I collapse the data into what I want below?

color               freq
BLUE|RED              2
WHITE|BLACK|ORANGE    2

CodePudding user response:

data.frame(table(sapply(strsplit(have$var1, '\\|'), 
             function(x)paste(sort(x), collapse = '|'))))

                Var1 Freq
1 BLACK|ORANGE|WHITE    2
2           BLUE|RED    2

In the world of piping: R > 4.0

have$var1 |>
  strsplit('\\|')|>
  sapply(\(x)paste0(sort(x), collapse = "|"))|>
  table()|>
  data.frame()

CodePudding user response:

Here is a tidyverse approach:

library(dplyr)
library(tidyr)

have %>% 
  group_by(id=row_number()) %>% 
  separate_rows(var1) %>% 
  arrange(var1, .by_group = TRUE) %>% 
  mutate(var1 = paste(var1, collapse = "|")) %>% 
  slice(1) %>% 
  ungroup() %>% 
  count(var1, name = "freq")
  var1                freq
  <chr>              <int>
1 BLACK|ORANGE|WHITE     2
2 BLUE|RED               2
  • Related