Home > Net >  Check if a string is a subset in another string in R
Check if a string is a subset in another string in R

Time:04-14

I've this data below that includes ID, and Code (chr type)

enter image description here

ID <- c(1,1,1,2,2,3,3,3, 4, 4)
Code <- c("0011100000", "0001100000", "1001100000", "1100000000", 
          "1000000000", "1000000000", "0100000000", "0010000000", "0010000001", "0010000001")
df <- data.frame(ID, Code)

I need to remove records (within each ID) based Code value pattern, That is:

For each ID, we look at the values of Code, and we remove the ones that are subset of other row.

For example, for ID=1, row #2 is a subset of row #1, so we remove row #2. But, row #3 is NOT a subset of row #2 or #3, so we keep it.

For ID=2, row #5 is a subset of row #4, so we remove it.

For ID=3, they are all different, so we keep them all.

For ID=4, since the Code for both records are the same, then keep the first one.

Here is the expected final view of the results:

enter image description here

CodePudding user response:

It's not that pretty, but a bit of checking of every combination with a join will do it.

Convert to a data.table

library(data.table)
setDT(df)

Make a row counter, and identify all the 1 locations in each string and save to a list.

df[, rn := .I]
df[, ones := gregexpr("1", df$Code)]

Join each group to itself, and compare the lists where the row numbers don't match. Then keep the row numbers where the lists are subsets, and drop these rows from the original data. In the case of duplicates, only remove the first occasion of the duplicate.

df[
  funion(
    df[df, on=c("ID","rn>rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"],
    df[df, on=c("ID","rn<rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"]
  ),
  on=c("ID","Code"),
  mult="first",
  drop := 1
]
df[is.na(drop), -c("rn","ones","drop")]


#   ID       Code
#1:  1 0011100000
#2:  1 1001100000
#3:  2 1100000000
#4:  3 1000000000
#5:  3 0100000000
#6:  3 0010000000
#7:  4 0010000001
  • Related