Home > OS >  Pandas - using str.contains to match string
Pandas - using str.contains to match string

Time:10-06

I have a column in my pandas df that looks like this:

Cycle 1 (0 h)
           A
           B
           C
Cycle 2 (0 h 43 min)
           A
           B
           C

I'm trying to match 'Cycle' and extract the digits. Ideally, I'd like for my output to look like this:

       0
       A
       B
       C
      0,43
       A
       B
       C

I've tried:

df['1'] = df['1'].str.contains('Cycle', regex=True).str.extract('(\d )')

But it gets rid of the Cycle line completely - I thought that after extracting the digits I could use str.split() and retain only the pertinent numbers which I could then separate by a comma. But I can't seem to extract the numbers.

CodePudding user response:

You can use

rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'] = df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)

See the regex demo. Here, the ^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.* pattern is searched for and if there is a match, it is replaced with Group 1 , Group 2 contents, or only Group 1 value depending on whether Group 2 matched or not.

Details:

  • ^ - start of string
  • Cycle - a word
  • \s - one or more whitespaces
  • \d - one or more digits
  • \s - one or more whitespaces
  • \( - a ( char
  • (\d ) - Group 1 (\1): one or more digits
  • (?:\s*\w \s*(\d ))? - an optional non-capturing group matching an optional sequence of
    • \s*\w \s* - one or more word chars enclosed with one or more whitespace chars
    • (\d ) - Group 2 (\2): one or more digits
  • .* - the rest of the string.

If Group 2 matched, the replacement is Group 1 , Group 2 values, else, it is only Group 1 value.

Pandas test:

import pandas as pd
df = pd.DataFrame({'1': ['Cycle 1 (0 h)', 'Cycle 1 (0 h 48 min)']})
rx = r'^Cycle\s \d \s \((\d )(?:\s*\w \s*(\d ))?.*'
df['1'].str.replace(rx, lambda x: f'{x.group(1)},{x.group(2)}' if x.group(2) else x.group(1), regex=True)
# => 0       0
# => 1    0,48
# => Name: 1, dtype: object
  • Related