Home > Software engineering >  Extract substring based on a pattern (including it) and n characters after this pattern
Extract substring based on a pattern (including it) and n characters after this pattern

Time:06-23

I'm struggling with regex I'd like to do something like this: Dummy data:

    strings <- c('asr#2ldf;wwABC=0.0732sss;63d!;',
                 'ggggABC=0.0001#$Gsxxaaafo',
                 'zzzdd$rfr67333dsass',
                 'ABC=0.9882ssssFGJJTRRREWE!!ddww',
                 'ABC=0.0921',
                 'sshdasljhg[aois^*3342222346677777ABC=0.752164sssdds33')
    df <- data.frame(strings)

And the desired result is:

                                                strings     result
1                        asr#2ldf;wwABC=0.0732sss;63d!; ABC=0.0732
2                             ggggABC=0.0001#$Gsxxaaafo ABC=0.0001
3                                   zzzdd$rfr67333dsass       <NA>
4                       ABC=0.9882ssssFGJJTRRREWE!!ddww ABC=0.9882
5                                            ABC=0.0921 ABC=0.0921
6 sshdasljhg[aois^*3342222346677777ABC=0.752164sssdds33 ABC=0.7521

I'd like to extract ABC with = and the number rounded to four decimal places. If there's no ABC then return NA. Strings may have different length and they can have every symbol, nevertheless ABC occurs only one per string. Moreover ABC is located in different position regarding each string. How can I do it?

CodePudding user response:

A possible solution:

library(tidyverse)

df %>% 
  mutate(result = str_extract(strings, "ABC\\=\\d \\.\\d{4}"))

#>                                                 strings     result
#> 1                        asr#2ldf;wwABC=0.0732sss;63d!; ABC=0.0732
#> 2                             ggggABC=0.0001#$Gsxxaaafo ABC=0.0001
#> 3                                   zzzdd$rfr67333dsass       <NA>
#> 4                       ABC=0.9882ssssFGJJTRRREWE!!ddww ABC=0.9882
#> 5                                            ABC=0.0921 ABC=0.0921
#> 6 sshdasljhg[aois^*3342222346677777ABC=0.752164sssdds33 ABC=0.7521

CodePudding user response:

One solution using dplyr and stringr

df %>% mutate(a = str_extract(strings, "ABC=[0-9].[0-9]{4}"))
                                                strings          a
1                        asr#2ldf;wwABC=0.0732sss;63d!; ABC=0.0732
2                             ggggABC=0.0001#$Gsxxaaafo ABC=0.0001
3                                   zzzdd$rfr67333dsass       <NA>
4                       ABC=0.9882ssssFGJJTRRREWE!!ddww ABC=0.9882
5                                            ABC=0.0921 ABC=0.0921
6 sshdasljhg[aois^*3342222346677777ABC=0.752164sssdds33 ABC=0.7521

CodePudding user response:

The regex could be like this in case the number are not always decimals:

"ABC=\\d \.?\\d{0,4}"
  • Related