Cannot seem to figure out this regex involving forward slash-CodePudding

I am trying to capture instances in my dataframe where a string has the following format:

'/random a/random b/random c/capture this/random again/random/random'

Where a string is preceded by four instances of '/', and more than two '/' appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return none.

In this instance 'capture this' should be captured and placed into a new column.

This is what I tried:

def extract_special_string(df, column):
    df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x) else None)


extract_special_string(df, 'column')

However nothing is being captured. Can anybody help with this regex? Thanks.

CodePudding user response：

You can use

df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/] )(?:/[^/]*){2}', expand=False)

See the regex demo

Details:

^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/] ) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.

CodePudding user response：

An alternative regex approach would be to use non-greedy quantifiers.

import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1))  # 'capture this'

/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s