Home > database >  Extract unique values within text list, where each item has constant common prefix or suffix or both
Extract unique values within text list, where each item has constant common prefix or suffix or both

Time:12-17

I have a set of variables with different prefixes and suffixes. There are two types. One type has only prefix. The second type has prefix, then a number, then suffix. The numbers in each type are unordered. Here is some example code of the two types

VarNamesType1 <- paste0( "Prefix1" ,  c(2,1,44,22)) 
VarNamesType1> 
[1] "Prefix12"  "Prefix11"  "Prefix144" "Prefix122"

Here are the variable names with a prefix and suffix

VarNamesType2 <- paste0( "Pre2" ,  c(9,3,5,7) , "Suffix2") 
VarNamesType2>
[1] "Pre29Suffix"  "Pre23Suffix2"  "Pre25Suffix2"  "Pre27Suffix2" 

Is there a way to find the unique values within those list of variable types. So for VarNamesType1 use a code find the values 2,1,44,22 and for VarNamesType2 find 9,3,5,7. Is it be possible to find the unique numbers for both types with the same code? Any ideas or suggestions would be greatly appreciated. Thanks

EDIT1- thank you to poster who showed soulution remove all text. however the prefix and suffix can contain numbers too. Therefore, removing text will not work. I've updated the example code.

EDIT2- I've now been able to use this to find the prefix part. I'm not sure how to find the suffix part.

find_common_start <- function(strings) {
  max_length = min(nchar(strings))
  for(len in max_length:1) {
    if(length(unique(substr(strings, start = 1, stop = len))) == 1) {
      return(substr(strings[[1]], start = 1, stop = len))
    }
  }
}

> find_common_start(VarNamesType1)
[1] "Prefix1"
 find_common_start(VarNamesType2)
[1] "Pre2"

Can this be adapted to do the suffix?

CodePudding user response:

We can use readr::parse_number, or remove all letters or extract all numbers with regexes.

With parse_number

readr::parse_number(VarNamesType1)

[1]  2  1 44 22

readr::parse_number(VarNamesType2)

[1] 9 3 5 7

with regex

stringr::str_extract(VarNamesType2, '\\d ') |>
    as.integer()

[1] 9 3 5 7

All values in the example data are already unique, but if we are interested in unique values for any dataset, we can pipe the output into unique(), as in:

readr::parse_number(VarNamesType1) |> unique()

EDIT

the OP informed the Suffixes and Prefixes may have numbers. In that case, parse_number() would not work, and we would have to use a regex-based approach.

We must have consistent "prefix" or "suffix" patterns in order to do that. We can use stringr::str_remove_all , to remove either the prefix or the suffix, and collapse them with "|":

library(glue)
library(stringr)

prefix<-'Pre2'
suffix<-'Suffix2'

str_remove_all(VarNamesType2, glue('^{prefix}|{suffix}$')) |>
    as.integer()

[1] 9 3 5 7

CodePudding user response:

I finally understood the question. To check for unique values burried between constants (prefix and suffix), we can first split the string into single characters, then drop list elements with lengths ==1 with purrr::pmap

library(purrr)


pmap(strsplit(VarNamesType2, ''), ~unique(c(...)))%>%
    keep(~length(.x) > 1) %>%
    unlist()%>%
    as.integer()

[1] 9 3 5 7

CodePudding user response:

I eventually wrote this, which answers my own question.

find_unique <- function(FindUnique) {
  max_presuffix = min( nchar( FindUnique ) )
  for ( i in 1:max_presuffix ) {
    if( length( unique( substr( FindUnique , start = 1, stop = i) ) ) == 1 ) {
      prefix <-(substr( FindUnique[[1]], start = 1, stop = i )) }
    if( length( unique( substr( FindUnique, start = nchar(FindUnique) - i   , stop = nchar( FindUnique ) ) ) )  == 1  ) {
      suffix <-(substr( FindUnique[[1]], start = nchar( FindUnique )- i , stop = nchar( FindUnique )[1] ) ) }
  }
  if (exists("prefix")){ FindUnique <- sub( prefix ,"", FindUnique ) } 
  if (exists("suffix")){FindUnique <- sub( suffix ,"", FindUnique ) }
return( FindUnique )  
}
    
> find_unique(VarNamesType1) 
[1] "2"  "1"  "44" "22"
> find_unique(VarNamesType2) 
[1] "9"  "3"  "5"  "7"

CodePudding user response:

A working (although a but convoluted) tidyverse answer. This relies on splitting the strings into lists of single characters, then finding the number of consecutive character positions which have only a single unique value both in the natural order (preffix) and rev()erse order (suffix)

library(dplyr)
library(stringr)
library(purrr)
library(data.table)
library(tidyr)

splitted_strings<-list(
    strsplit(VarNamesType2, ''),
    rev_char_list = map(strsplit(VarNamesType2, ''), rev)
)

indexes<-splitted_strings %>%
    map_int(., \(x) sum(
        x %>%
        tibble(temp = .) %>%
        unnest_wider(temp)%>%
        map_int(~length(unique(.x))) %>%
        data.table::rleid(.)==1
        )) %>%
    set_names(c('prefix', 'suffix'))

str_sub(VarNamesType2,
        start = indexes['prefix'] 1,
        end = -(indexes['suffix'] 1))

[1] "9"  "3"  "5"  "7"  "88"
  • Related