Make strings uniform -- "AB-CD" === "CD-AB"-CodePudding

Sample dput() below:

structure(list(group = c(34676739L, 45938970L, 22731473L, 40083768L, 
22527333L, 51629537L, 26299463L, 27420157L, 24898717L, 43569190L, 
34573189L, 44503577L, 25471327L, 44630117L, 19048782L, 39710425L, 
33535680L, 54358561L, 27363448L, 39386432L, 44150096L, 24614702L, 
36219027L, 39609036L, 10803983L, 54770896L, 27574728L, 40912817L, 
24679610L, 40261463L), partners = c("US-GB", "US-JP", "US-JP", 
"US-GB", "GB-US", "US-GB", "GB-US", "US-GB", "US-GB", "US-JP", 
"US-GB", "US-GB", "US-GB", "GB-US", "JP-US", "US-JP", "JP-US", 
"JP-US", "US-GB", "US-GB", "US-JP", "US-GB", "GB-US", "GB-US", 
"US-GB", "US-GB", "US-JP", "JP-US", "US-GB", "US-GB")), row.names = c(NA, 
-30L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001fc21f23b00>)

What I want to do: I want to create a new variable, say partners_final, that normalizes the entries in partners. You'll see that partners has entries US-GB, GB-US, US-JP, JP-US. These simply represent relationships between business partners, and so technically, US-JP == JP-US and US-GB == GB-US.

However, these entries are (obviously) not equivalent in R, which makes it tough when doing empirics. So what I want to do is make a new variable partners_final that gives a uniform business partner-pair regardless of the order of the two partners.

NOTE that in my actual data set, there are many, many partners. I need to do something that applies to the entire data set, e.g., partners_final must reflect that AB-CD === CD-AB for all pairs AB, CD. Is there any way I can do this in R (preferably avoiding pivot as there are some country pairs that don't show up in the data that might later need to be accounted for)?

CodePudding user response：

You need some logic for ordering I assume. In this case, I will assume an alphabetical ordering of the normalized pairs. You could split, sort and re-join.

library(stringr)
d <- structure(list(
  group = c(
    34676739L, 45938970L, 22731473L, 40083768L,
    22527333L, 51629537L, 26299463L, 27420157L, 24898717L, 43569190L,
    34573189L, 44503577L, 25471327L, 44630117L, 19048782L, 39710425L,
    33535680L, 54358561L, 27363448L, 39386432L, 44150096L, 24614702L,
    36219027L, 39609036L, 10803983L, 54770896L, 27574728L, 40912817L,
    24679610L, 40261463L
  ),
  partners = c(
    "US-GB", "US-JP", "US-JP",
    "US-GB", "GB-US", "US-GB", "GB-US", "US-GB", "US-GB", "US-JP",
    "US-GB", "US-GB", "US-GB", "GB-US", "JP-US", "US-JP", "JP-US",
    "JP-US", "US-GB", "US-GB", "US-JP", "US-GB", "GB-US", "GB-US",
    "US-GB", "US-GB", "US-JP", "JP-US", "US-GB", "US-GB"
  )
),
row.names = c(NA, -30L), class = c("data.table", "data.frame")
)


d$cleaned <- lapply(str_split(d$partners, "-"), 
                    function(x) paste0(sort(x), collapse = "-" ))