Home > OS >  Extracting a word between two path separators that comes after a specific word
Extracting a word between two path separators that comes after a specific word

Time:09-29

I have the following path stored as a python string 'C:\ABC\DEF\GHI\App\Module\feature\src' and I would like to extract the word Module that is located between words \App\ and \feature\ in the path name. Note that there are file separators '\' in between which ought not to be extracted, but only the string Module has to be extracted.

I had the few ideas on how to do it:

  1. Write a RegEx that matches a string between \App\ and \feature\
  2. Write a RegEx that matches a string after \App\ --> App\\[A-Za-z0-9]*\\, and then split that matched string in order to find the Module.

I think the 1st solution is better, but that unfortunately it goes over my RegEx knowledge and I am not sure how to do it.

I would much appreciate any help.

Thank you in advance!

CodePudding user response:

Your are looking for groups. With some small modificatians you can extract only the part between App and Feature.

(?:App\\\\)([A-Za-z0-9]*)(?:\\\\feature)

The brackets ( ) define a Match group which you can get by match.group(1). Using (?:foo) defines a non-matching group, e.g. one that is not included in your result. Try the expression here: https://regex101.com/r/24mkLO/1

CodePudding user response:

We can do that by str.find somethings like

str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
import re
start = '\\App\\'
end = '\\feature\\'

print( (str[str.find(start) len(start):str.rfind(end)]))
print("\n")

output

Module

CodePudding user response:

The regex you want is:

(?<=\\App\\).*?(?=\\feature\\)

Explanation of the regex:

  • (?<=behind)rest matches all instances of rest if there is behind immediately before it. It's called a positive lookbehind
  • rest(?=ahead) matches all instances of rest where there is ahead immediately after it. This is a positive lookahead.
  • \ is a reserved character in regex patterns, so to use them as part of the pattern itself, we have to escape it; hence, \\
  • .* matches any character, zero or more times.
  • ? specifies that the match is not greedy (so we are implicitly assuming here that \feature\ only shows up once after \App\).
  • The pattern in general also assumes that there are no \ characters between \App\ and \feature\.

The full code would be something like:

str = 'C:\\ABC\\DEF\\GHI\\App\\Module\\feature\\src'
start = '\\App\\'
end = '\\feature\\'

pattern = rf"(?<=\{start}\).*?(?=\{end}\)"

print(pattern)                            # (?<=\\App\\).*?(?=\\feature\\)
print(re.search(pattern, str)[0])         # Module

A link on regex lookarounds that may be helpful: https://www.regular-expressions.info/lookaround.html

  • Related