Home > Net >  How to split a column in R using lower case sensitivity as a factor and look-behind feature
How to split a column in R using lower case sensitivity as a factor and look-behind feature

Time:06-01

I have a large dataframe in R that is comprised of lower case and uppercase letters in a single column.

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC'), 
                  stringsAsFactors = FALSE)

I would like the output to take all of the 'upper case letters' to the left of any lower case letters; i.e., something similar to a look-behind feature.

For example

GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GCCTTGATTTTTTGGCGGGGACCGT GATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GATTTTTTGGCGGGGACCGT ACTGGTTCCAcCAGCGGGTCACGAC would become ACTGGTTCCA

I am only interested in the upper case characters to the left hand side of the first instance of lower case characters. I would like also for the code to not fall over if there is no instance of lower case.

I have tried looking at: Splitting strings by case but i cannot seem to adapt it to look behind for upper case.

I really thank you in advance for your help.

CodePudding user response:

Code:

library(tidyverse)

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC', 'BAARA'), 
                  stringsAsFactors = FALSE)
df1


df1$a <- str_trim(str_extract(df1$a , "([:upper:]|[:space:]){2,}"))
df1

Output:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                     BAARA    #This one not having any lower case charater from the begining

Putting NA, where the string dont have any lower cases charaters.

 for (i in 1 :nrow(df1)){  
    if(is.na(str_extract(df1[i,'a'], "([:lower:]|[:space:]){1,}"))) 
       {df1[i,'a'] <- NA}
    else 
       {df1[i,'a'] <- str_trim(str_extract(df1[i,'a'] , "([:upper:]|[:space:]){2,}"))}
     df1[i,'b'] <- df1[i,'a']   
    }
 df1

Output:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                      <NA>

CodePudding user response:

sub("([A-Z] )[a-z].*", "\\1", df1$a)

# [1] "GCCTTGATTTTTTGGCGGGGACCGTatGGCGTCGC"
# [2] "GATTTTTTGGCGGGGACCGTatGGCGTCGC"     
# [3] "TCACCACCATCCATTCTGC"                
# [4] "ACTGGTTCCACAGCGGGTCACGAC"

CodePudding user response:

You can use sub with [a-z].* or [[:lower:]].* to remove the first lower case letter and everything after.

sub("[a-z].*", "", df1$a)
#sub("[[:lower:]].*", "", df1$a) #Alternative
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               

Set to NA where there is no lower case:

df1 <- rbind(df1, "ABC")               #Add without lower case
is.na(df1$a) <- !grepl("[a-z]", df1$a) #set NA where no lower case
sub("[a-z].*", "", df1$a)
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               
#[5] NA                         

CodePudding user response:

You can do it all with a line of code with a positive lookahead regex (capturing everything up to the first lower case), so you don't need to deal with the NA's. Either there is a match or not.

stringr::str_extract(df1$a, ". ?(?=[a-z])")

#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"      "TCACCACCATC"              
#[4] "ACTGGTTCCA"                NA    

To add a new column b with the result as asked in the comments:

df1 |> dplyr::mutate(b = stringr::str_extract(a, ". ?(?=[a-z])"))

#                                      a                         b
# 1 GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC GCCTTGATTTTTTGGCGGGGACCGT
# 2      GATTTTTTGGCGGGGACCGTcatGGCGTCGC      GATTTTTTGGCGGGGACCGT
# 3                 TCACCACCATCtCATTCTGC               TCACCACCATC
# 4            ACTGGTTCCAcCAGCGGGTCACGAC                ACTGGTTCCA
# 5                                BAARA                      <NA>
  • Related