Home > Software engineering >  Make strings uniform -- "AB-CD" === "CD-AB"
Make strings uniform -- "AB-CD" === "CD-AB"

Time:10-11

Sample dput() below:

structure(list(group = c(34676739L, 45938970L, 22731473L, 40083768L, 
22527333L, 51629537L, 26299463L, 27420157L, 24898717L, 43569190L, 
34573189L, 44503577L, 25471327L, 44630117L, 19048782L, 39710425L, 
33535680L, 54358561L, 27363448L, 39386432L, 44150096L, 24614702L, 
36219027L, 39609036L, 10803983L, 54770896L, 27574728L, 40912817L, 
24679610L, 40261463L), partners = c("US-GB", "US-JP", "US-JP", 
"US-GB", "GB-US", "US-GB", "GB-US", "US-GB", "US-GB", "US-JP", 
"US-GB", "US-GB", "US-GB", "GB-US", "JP-US", "US-JP", "JP-US", 
"JP-US", "US-GB", "US-GB", "US-JP", "US-GB", "GB-US", "GB-US", 
"US-GB", "US-GB", "US-JP", "JP-US", "US-GB", "US-GB")), row.names = c(NA, 
-30L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001fc21f23b00>)

What I want to do: I want to create a new variable, say partners_final, that normalizes the entries in partners. You'll see that partners has entries US-GB, GB-US, US-JP, JP-US. These simply represent relationships between business partners, and so technically, US-JP == JP-US and US-GB == GB-US.

However, these entries are (obviously) not equivalent in R, which makes it tough when doing empirics. So what I want to do is make a new variable partners_final that gives a uniform business partner-pair regardless of the order of the two partners.

NOTE that in my actual data set, there are many, many partners. I need to do something that applies to the entire data set, e.g., partners_final must reflect that AB-CD === CD-AB for all pairs AB, CD. Is there any way I can do this in R (preferably avoiding pivot as there are some country pairs that don't show up in the data that might later need to be accounted for)?

CodePudding user response:

You need some logic for ordering I assume. In this case, I will assume an alphabetical ordering of the normalized pairs. You could split, sort and re-join.

library(stringr)
d <- structure(list(
  group = c(
    34676739L, 45938970L, 22731473L, 40083768L,
    22527333L, 51629537L, 26299463L, 27420157L, 24898717L, 43569190L,
    34573189L, 44503577L, 25471327L, 44630117L, 19048782L, 39710425L,
    33535680L, 54358561L, 27363448L, 39386432L, 44150096L, 24614702L,
    36219027L, 39609036L, 10803983L, 54770896L, 27574728L, 40912817L,
    24679610L, 40261463L
  ),
  partners = c(
    "US-GB", "US-JP", "US-JP",
    "US-GB", "GB-US", "US-GB", "GB-US", "US-GB", "US-GB", "US-JP",
    "US-GB", "US-GB", "US-GB", "GB-US", "JP-US", "US-JP", "JP-US",
    "JP-US", "US-GB", "US-GB", "US-JP", "US-GB", "GB-US", "GB-US",
    "US-GB", "US-GB", "US-JP", "JP-US", "US-GB", "US-GB"
  )
),
row.names = c(NA, -30L), class = c("data.table", "data.frame")
)


d$cleaned <- lapply(str_split(d$partners, "-"), 
                    function(x) paste0(sort(x), collapse = "-" ))
  • Related