Home > OS >  Delimiting string in R
Delimiting string in R

Time:10-13

I hope everyone is having a blast I have come to face this challange:

I want to be able to extract one portion of a string in the folliing manner:

  1. The string may or may not have a dot or may have plenty of them
  2. I want to extract the string part that is before the first dot, if there is no dot then I want the whole string
  3. I want to use a regex to achieve this
    test<-c("This_This-This.Not This",
            "This_This-This.not_.this",
            "This_This-This",
            "this",
            "this.Not This")

since I need to use a regex, I have been trying to use this expression:

str_match(test,"(^[a-zA-Z]. )[\\.\\b]?")[,2]

but what I get is:

> str_match(test,"(^[a-zA-Z]. )[\\.\\b]?")[,2]
[1] "This_This-This.Not This" "This_This-This.not_this"
[3] "This_This-This"          "this"                   
[5] "this.Not This"          
> 

My desired output is:

"This_This-This"
"This_This-This"
"This_This-This"
"this"
"this"

This is my thought process behind the regex

str_match(test,"(^[a-zA-Z]. )[\\.\\b]?")[,2]

(^[a-zA-Z]. )= this to capture the group before the dot since the string starts always with a letter cpas or lowers case, and all other strings after that thats why the .

[\.\b]?=a dot or a world boundary that may or may not be thats why the ?

Is not giving what I want and I will be so happy if yo guys can help me out to understand my miskte here thank you so much!!!

CodePudding user response:

Actually, rather than extracting, a regex replacement should work well here:

test <- c("This_This-This.Not This",
          "This_This-This.not_.this",
          "This_This-This",
          "this",
          "this.Not This")
output <- sub("\\..*", "", test)
output

[1] "This_This-This" "This_This-This" "This_This-This" "this"          
[5] "this

Replacement works well here because it no-ops for any input not having any dots, in which case the original string is returned.

CodePudding user response:

My regex is "match anything up to either a dot or the end of the line".

library(stringr)
str_match(test, "^(.*?)(\\.|$)")[, 2]

Result:

[1] "This_This-This" "This_This-This" "This_This-This" "this" "this"          
  • Related