I am trying to capture instances in my dataframe where a string has the following format:
'/random a/random b/random c/capture this/random again/random/random'
Where a string is preceded by four instances of '/', and more than two '/' appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return none.
In this instance 'capture this' should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/] (?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
CodePudding user response:
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/] )(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^
- start of string(?:[^/]*/){4}
- four occurrences of any zero or more chars other than/
and then a/
char([^/] )
- Capturing group 1:one or more chars other than a/
char(?:/[^/]*){2}
- two occurrences of a/
char and then any zero or more chars other than/
.
CodePudding user response:
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3}
- match the part beforecapture this
, matching any but as few as possible characters between each pair of/
s (use noncapturing group to ignore the contents)(.*?)
- capturecapture this
(since this is a capturing group, we can fetch the contents from<match_object>.group(1)
(?:/.*?){2,}
- same as the first part, match as few characters as possible in between each pair of/
s