I have a column in my pandas df that looks like this:
Cycle 1 (0 h)
A
B
C
Cycle 2 (0 h 43 min)
A
B
C
I'm trying to match 'Cycle' and extract the digits. Ideally, I'd like for my output to look like this:
0
A
B
C
0,43
A
B
C
I've tried:
df['1'] = df['1'].str.contains('Cycle', regex=True).str.extract('(\d )')
But it gets rid of the Cycle line completely - I thought that after extracting the digits I could use str.split()
and retain only the pertinent numbers which I could then separate by a comma. But I can't seem to extract the numbers.
CodePudding user response:
You can use
rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'] = df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)
See the regex demo. Here, the ^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*
pattern is searched for and if there is a match, it is replaced with Group 1 ,
Group 2 contents, or only Group 1 value depending on whether Group 2 matched or not.
Details:
^
- start of stringCycle
- a word\s
- one or more whitespaces\d
- one or more digits\s
- one or more whitespaces\(
- a(
char(\d )
- Group 1 (\1
): one or more digits(?:\s*\w \s*(\d ))?
- an optional non-capturing group matching an optional sequence of\s*\w \s*
- one or more word chars enclosed with one or more whitespace chars(\d )
- Group 2 (\2
): one or more digits
.*
- the rest of the string.
If Group 2 matched, the replacement is Group 1 ,
Group 2 values, else, it is only Group 1 value.
Pandas test:
import pandas as pd
df = pd.DataFrame({'1': ['Cycle 1 (0 h)', 'Cycle 1 (0 h 48 min)']})
rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)
# => 0 0
# => 1 0,48
# => Name: 1, dtype: object