I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt
and I want to catch ONLY what is between the third and fourth underscores: textimput
- I'm using R,
stringr
, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_] )_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
CodePudding user response:
stringr
uses ICU
based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<=
wouldn't work. Another option is to either remove the substrings with str_remove
or use str_replace
to match and capture the third word which doesn't have the _
([^_]
) and replace with the backreference (\\1
) of the captured word
library(stringr)
str_replace(my_string, "^[^_] _[^_] _([^_] )_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_] _[^_] _([^_]{4}).*", "\\1")
[1] "text"
In base R
, it is easier with strsplit
and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE
in regexpr
regmatches(my_string, regexpr("^([^_] _){2}\\K[^_] ", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_] _){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
CodePudding user response:
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_] _[^_] _([^_]{4}).*$", "\\1", my_string)
[1] "text"