I have a large dataframe in R that is comprised of lower case and uppercase letters in a single column.
df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC'),
stringsAsFactors = FALSE)
I would like the output to take all of the 'upper case letters' to the left of any lower case letters; i.e., something similar to a look-behind feature.
For example
GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GCCTTGATTTTTTGGCGGGGACCGT GATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GATTTTTTGGCGGGGACCGT ACTGGTTCCAcCAGCGGGTCACGAC would become ACTGGTTCCA
I am only interested in the upper case characters to the left hand side of the first instance of lower case characters. I would like also for the code to not fall over if there is no instance of lower case.
I have tried looking at: Splitting strings by case but i cannot seem to adapt it to look behind for upper case.
I really thank you in advance for your help.
CodePudding user response:
Code:
library(tidyverse)
df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC', 'BAARA'),
stringsAsFactors = FALSE)
df1
df1$a <- str_trim(str_extract(df1$a , "([:upper:]|[:space:]){2,}"))
df1
Output:
a
1 GCCTTGATTTTTTGGCGGGGACCGT
2 GATTTTTTGGCGGGGACCGT
3 TCACCACCATC
4 ACTGGTTCCA
5 BAARA #This one not having any lower case charater from the begining
Putting NA, where the string dont have any lower cases charaters.
for (i in 1 :nrow(df1)){
if(is.na(str_extract(df1[i,'a'], "([:lower:]|[:space:]){1,}")))
{df1[i,'a'] <- NA}
else
{df1[i,'a'] <- str_trim(str_extract(df1[i,'a'] , "([:upper:]|[:space:]){2,}"))}
df1[i,'b'] <- df1[i,'a']
}
df1
Output:
a
1 GCCTTGATTTTTTGGCGGGGACCGT
2 GATTTTTTGGCGGGGACCGT
3 TCACCACCATC
4 ACTGGTTCCA
5 <NA>
CodePudding user response:
sub("([A-Z] )[a-z].*", "\\1", df1$a)
# [1] "GCCTTGATTTTTTGGCGGGGACCGTatGGCGTCGC"
# [2] "GATTTTTTGGCGGGGACCGTatGGCGTCGC"
# [3] "TCACCACCATCCATTCTGC"
# [4] "ACTGGTTCCACAGCGGGTCACGAC"
CodePudding user response:
You can use sub
with [a-z].*
or [[:lower:]].*
to remove the first lower case letter and everything after.
sub("[a-z].*", "", df1$a)
#sub("[[:lower:]].*", "", df1$a) #Alternative
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"
#[3] "TCACCACCATC" "ACTGGTTCCA"
Set to NA
where there is no lower case:
df1 <- rbind(df1, "ABC") #Add without lower case
is.na(df1$a) <- !grepl("[a-z]", df1$a) #set NA where no lower case
sub("[a-z].*", "", df1$a)
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"
#[3] "TCACCACCATC" "ACTGGTTCCA"
#[5] NA
CodePudding user response:
You can do it all with a line of code with a positive lookahead regex (capturing everything up to the first lower case), so you don't need to deal with the NA
's. Either there is a match or not.
stringr::str_extract(df1$a, ". ?(?=[a-z])")
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT" "TCACCACCATC"
#[4] "ACTGGTTCCA" NA
To add a new column b with the result as asked in the comments:
df1 |> dplyr::mutate(b = stringr::str_extract(a, ". ?(?=[a-z])"))
# a b
# 1 GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC GCCTTGATTTTTTGGCGGGGACCGT
# 2 GATTTTTTGGCGGGGACCGTcatGGCGTCGC GATTTTTTGGCGGGGACCGT
# 3 TCACCACCATCtCATTCTGC TCACCACCATC
# 4 ACTGGTTCCAcCAGCGGGTCACGAC ACTGGTTCCA
# 5 BAARA <NA>