Home > OS >  Sorting strings lexicographically based upon certain character order
Sorting strings lexicographically based upon certain character order

Time:01-21

I am working my way through the Rosalind problems and have become stuck on the problem Ordering Strings of Varying Length Lexicographically.

So far I have succesfully managed to create all the right permutations of letters of varying lengths. The main issue is now how to sort them based on the order the letters were supplied.

The example input are the letters D N A. But can be at most 12 unique letters in n <= 4 permutations.

For the example n = 3. This yields 39 different permutations with replications, but these are then to be sorted lexicographically in the order of D before N before A.

Meaning that the correct order is:

Correct Incorrect
D A
DD AA
DDD AAA
DDN AAD
DDA AAN
DN AD
DND ADA
DNN ADD
DNA ADN
DA AN
DAD ANA
DAN AND
DAA ANN
... ...
AAD NNA
AAN NND
AAA NNN

My first thought was to convert them to factors with levels and then attempt to sort them based on their levels but I cannot quite make it work.

so far I create the list of all permutations and then attempt to sort it, but don't know how to get the sorting to follow the given order of D N A

text_input <- c("D", "N", "A")
n <- 3

empty_df <- data.frame(matrix("", ncol = n))

temp_df <- data.frame()

for (i in n:1) {
  temp_df <- data.frame(arrangements::permutations(text_input, k = i, replace = TRUE))
  empty_df <- bind_rows(empty_df, temp_df)
}

result_df <- replace(empty_df, is.na(empty_df), "") |> 
  unite(col = combined, everything(), sep = "", remove = FALSE) |> 
  mutate(across(2:(n 2), ~ factor(.x, levels = text_input)),
         across(2:(n 2), ~ str_replace_na(.x, replacement = "")))

result_vec <- tail(result_df$combined, -1)

CodePudding user response:

I'll use the sample data you have, Correct, randomized to ensure we get the order right.

quux <- structure(list(Correct = c("D", "DD", "DDD", "DDN", "DDA", "DN", "DND", "DNN", "DNA", "DA", "DAD", "DAN", "DAA", "AAD", "AAN", "AAA"), Incorrect = c("A", "AA", "AAA", "AAD", "AAN", "AD", "ADA", "ADD", "ADN", "AN", "ANA", "AND", "ANN", "NNA", "NND", "NNN")), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 15L, 16L, 17L), class = "data.frame")
set.seed(42)
quuxR <- quux[sample(nrow(quux)),]
quuxR$Correct
#  [1] "D"   "DDA" "AAA" "DNA" "DA"  "DDN" "DD"  "AAD" "DNN" "DND" "DAD" "DAA" "AAN" "DAN" "DDD" "DN" 

ltrs <- c("D", "N", "A")
inds <- lapply(strsplit(quux$Correct, ""), match, table = ltrs)
inds <- lapply(inds, `length<-`, max(lengths(inds)))
quuxR$Correct[do.call(order, c(do.call(Map, c(f=c, inds)), list(na.last = FALSE)))]
#  [1] "D"   "DD"  "DDD" "DDN" "DDA" "DN"  "DND" "DNN" "DNA" "DA"  "DAD" "DAN" "DAA" "AAD" "AAN" "AAA"
identical(quux$Correct, quuxR$Correct[do.call(order, c(do.call(Map, c(f=c, inds)), list(na.last = FALSE)))])
# [1] TRUE

Steps:

  1. strsplit(.., "") splits a string into individual letters.
  2. match(.., table=ltrs) replaces a letter with the index in ltrs, which gives us the proper priority/sort of the letters.
  3. `length<-` is because we are going to subsequently rbind them, but that will only work if they all have the same length. Doing this step (along with max(lengths(..)) fills the shorter vectors with NAs so that all sub-vectors have the same number of elements.
  4. do.call(Map, c(f=c, inds)) transposes a list with m vectors length n into a n-length list, each with m elements. This is very helpful to have for the next step ...
  5. do.call(order, ..) is similar to calling order(L[[1]], L[[2]], L[[3]]) (if L is the transposed list), but more programmatic. We add list(na.last=FALSE) as an argument to the list of vectors so that we can get the right ordering of shorter strings.

CodePudding user response:

You can use chartr and order its result and use this to subset the original unordered vector.

x <- c("DD", "AAN", "DDD", "DNA", "DAA", "DAD", "DDN", "DDA", "DND", 
       "DAN", "DNN", "DA", "AAA", "AAD", "DN", "D") #Unordered
y <- c("D", "DD", "DDD", "DDN", "DDA", "DN", "DND", "DNN", "DNA", "DA",
       "DAD", "DAN", "DAA", "AAD", "AAN", "AAA") #Target order

z <- x[order(chartr("DNA", "ABC", x))]
identical(z, y)
#[1] TRUE

z
# [1] "D"   "DD"  "DDD" "DDN" "DDA" "DN"  "DND" "DNN" "DNA" "DA"  "DAD" "DAN"
#[13] "DAA" "AAD" "AAN" "AAA"
  • Related