Home > Software engineering >  How to find mirrored values in R or shell?
How to find mirrored values in R or shell?

Time:02-03

I'm having trouble figuring out how to find mirrored values in R. "Mirror" may be an incorrect term here, which I think is challenging my search for answers. Maybe "inverse duplicate" or "reverse duplicate" would make more sense. Whatever it's called, I think it's a pretty simple idea that may or may not have a simple solution.

Here's my data:


df <- data.frame("unique"=c("chr1_10_20:chr1_10_20","chr1_10_20:chr1_20_30", "chr1_10_20:chr1_30_40", "chr1_20_30:chr1_10_20"))

> df 
                 unique
1 chr1_10_20:chr1_10_20
2 chr1_10_20:chr1_20_30
3 chr1_10_20:chr1_30_40
4 chr1_20_30:chr1_10_20


I am interested in finding the rows which are mirrors of other rows, using the colon (:) as a central divider. You can see what I mean by this by looking at rows 2 and 4. On row 2, the left side of the : is chr1_10_20, and the right side is chr1_20_30. Row 4 is exactly the inverse, with the left side reading chr1_20_30 and the right side chr1_10_20.

I would like to obtain the row numbers that display this mirrored quality. My desired output for the above example would be:

2 4

or

2,4

I am working in R, but might consider using the shell if it's easier. Thanks in advance!

CodePudding user response:

Using Regex to group the text before and after the colon; then make a new string that is after:before and save as new variable flip.

Then use match to find the matching flipped row:

df <- data.frame(text=c("chr1_10_20:chr1_10_20","chr1_10_20:chr1_20_30", "chr1_10_20:chr1_30_40", "chr1_20_30:chr1_10_20"))

df$flip <- gsub("^(.*):(.*)$", "\\2:\\1", df$text)

df$matchrow <- match(df$text, df$flip)

df
#                   text                  flip matchrow
#1 chr1_10_20:chr1_10_20 chr1_10_20:chr1_10_20        1
#2 chr1_10_20:chr1_20_30 chr1_20_30:chr1_10_20        4
#3 chr1_10_20:chr1_30_40 chr1_30_40:chr1_10_20       NA
#4 chr1_20_30:chr1_10_20 chr1_10_20:chr1_20_30        2

CodePudding user response:

Here's one simple way of solving this problem. We split the data frame into two columns and use a loop to see where each row has an identical entry in the other column, then do the same in reverse, and finally remove those where it's identical. We can then use which with arr.ind=TRUE to get the matches out that we want.

df <- data.frame("unique"=c("chr1_10_20:chr1_10_20","chr1_10_20:chr1_20_30", "chr1_10_20:chr1_30_40", "chr1_20_30:chr1_10_20"))

df_split <- strsplit(df$unique, ":") %>% do.call(what=rbind)

sim_mat <- sapply(1:nrow(df_split), function(i){
  df_split[i,1]==df_split[,2] & df_split[i,2]==df_split[,1] & df_split[i,1]!=df_split[i,2]
})
all_matches <- which(sim_mat, arr.ind = TRUE)
all_matches[all_matches[,"row"]<all_matches[,"col"]]

which gives

[1] 2 4
  • Related