Pandas - using str.contains to match string-CodePudding

I have a column in my pandas df that looks like this:

Cycle 1 (0 h)
           A
           B
           C
Cycle 2 (0 h 43 min)
           A
           B
           C

I'm trying to match 'Cycle' and extract the digits. Ideally, I'd like for my output to look like this:

I've tried:

df['1'] = df['1'].str.contains('Cycle', regex=True).str.extract('(\d )')

But it gets rid of the Cycle line completely - I thought that after extracting the digits I could use str.split() and retain only the pertinent numbers which I could then separate by a comma. But I can't seem to extract the numbers.

CodePudding user response：

You can use

rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'] = df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)

See the regex demo. Here, the ^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.* pattern is searched for and if there is a match, it is replaced with Group 1 , Group 2 contents, or only Group 1 value depending on whether Group 2 matched or not.

Details:

^ - start of string
Cycle - a word
\s - one or more whitespaces
\d - one or more digits
\s - one or more whitespaces
\( - a ( char
(\d ) - Group 1 (\1): one or more digits
(?:\s*\w \s*(\d ))? - an optional non-capturing group matching an optional sequence of
- \s*\w \s* - one or more word chars enclosed with one or more whitespace chars
- (\d ) - Group 2 (\2): one or more digits
.* - the rest of the string.

If Group 2 matched, the replacement is Group 1 , Group 2 values, else, it is only Group 1 value.

Pandas test:

import pandas as pd
df = pd.DataFrame({'1': ['Cycle 1 (0 h)', 'Cycle 1 (0 h 48 min)']})
rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)
# => 0       0
# => 1    0,48
# => Name: 1, dtype: object