R conditional logic to replace a character in a string, based on the preceeding and following charac-CodePudding

I have a vector of strings in which I have replaced the spaces with underscores. I'm going to reconvert them to spaces, however, there are some syntax errors in the original data which means that some of the spaces shouldn't actually be spaces. I have some simple conditional logic to describe the circumstances when an underscore should be replaced with a space, when it should be replaced with a dash (-), or when it should be removed altogether.

The strings are chemical compound names. In cases where the underscore follows or precedes a number, the underscore should be replaced with a dash ("-"). In cases where the underscore precedes and follows a letter, it should be replaced with a space (" "). And where an underscore precedes or follows a dash, the underscore should be removed without replacement. More than one of these scenarios may apply in different places in a given string. An additional issue is that where a numerical digit directly follows or precedes a letter, there should be a dash between them.

Here is a minimal dataset that demonstrates all of these scenarios and the desired result. Note that the actual dataset has over 35 thousand entries (only 670 unique ones though).

names
[1] "1,8_cineole" "geranyl_acetate" "AR_curcumene" "trans_trans_a-farnesene" "trans_muurola_4,5_diene"
[6] "p_cymene" "a_-_pinene" "cadina_3,5_diene" "germacrene_D" "trans_cadina1,4diene"

converted_names
[1] "1,8-cineole" "geranyl acetate" "AR curcumene" "trans trans a-farnesene" "trans muurola-4,5-diene"
[6] "p cymene" "a-pinene" "cadina-3,5-diene" "germacrene D" "trans cadina-1,4-diene"

I was thinking about approaching this through nested loops that iterate through the names list and then split the string for each name and iterate through the individual characters of the name, but I'm getting a bit lost in applying the conditional logic required to substitute individual characters in the string.

convert_compound_names<-function(x){
    underscore_locations<-lapply(strsplit(x,""),function(x) which(x=="_"))
    digit_locations<-lapply(strsplit(x,""),function(x) grep("\\d",x))
    for(i in c(1:length(x)))
      split_name<-unlist(strsplit(x[i],""))
          for (j in c(1:length(split_name)))){
         #some conditional logic to replace underscores here
          }
      x[i]<-paste0(split_name[1:length(split_name)],collapse="")
     }
    return(x) 
}

I also wondered if the conditional logic could be incorporated into a gsub function and the looping may not be necessary..?

For the record, I'm a chemist, not a programmer or data-scientist, so any advice, suggestions, or moral support would be appreciated.

Thanks for reading.

CodePudding user response：

I worked out the conditional logic needed to address the substitution of underscores within the loops that I proposed above:

convert_compound_names<-function(x){

    for(i in c(1:length(x))){
      split_name<-unlist(strsplit(x[i],""))
          for (j in c(1:length(split_name))){
         #some conditional logic to replace underscores here
            if(split_name[j]=="_"){
              if(grepl("\\d",split_name[j-1])|(grepl("\\d",split_name[j 1]))){split_name[j]<-"-"}
              else if(grepl("-",split_name[j-1])|(grepl("-",split_name[j 1]))){split_name[j]<-""}
              else if(grepl("[a-zA-Z]",split_name[j-1])&&(grepl("[a-zA-Z]",split_name[j 1]))){split_name[j]<-" "}
            }
          }
        x[i]<-paste0(split_name[1:length(split_name)],collapse="")
    }
    return(x) 
}

However I'm sure there's a more straightforward way of doing this to be found.

CodePudding user response：

chem_names <- c("1,8_cineole", "geranyl_acetate", "AR_curcumene", "trans_trans_a-farnesene",
                "trans_muurola_4,5_diene", "p_cymene", "a_-_pinene", "cadina_3,5_diene",
                "germacrene_D", "trans_cadina1,4diene")

This sounds like a regex problem, which I'm still new at, but I think the code below will do what you want.

Here, I first replace all "_#" with "-" using a lookaround in the form of (?=\\d) so that it will find but not replace the number [\\d], which comes after the underscore [\\_] which will be replaced by the -. Then same deal the dash that follows a number, and for any remaining underscores to become spaces.

library(dpylr); library(stringr)
data.frame(chem_names) %>%
  mutate(chem_names2 = chem_names %>%
           str_replace_all("\\_(?=\\d)", "-") %>%  # replace _# with -
           str_replace_all("(?<=\\d)\\_", "-") %>% # replace #_ with -
           str_replace_all("\\_", " "))            # replace _ with space

Result

                chem_names             chem_names2
1              1,8_cineole             1,8-cineole
2          geranyl_acetate         geranyl acetate
3             AR_curcumene            AR curcumene
4  trans_trans_a-farnesene trans trans a-farnesene
5  trans_muurola_4,5_diene trans muurola-4,5-diene
6                 p_cymene                p cymene
7               a_-_pinene              a - pinene
8         cadina_3,5_diene        cadina-3,5-diene
9             germacrene_D            germacrene D
10    trans_cadina1,4diene    trans cadina1,4diene