Home > Software engineering >  pandas: split column value based on multiple conditions
pandas: split column value based on multiple conditions

Time:01-11

I am trying to convert a str type column to multiple rows based on different scenarios:

for example, I have the column X as

col A   col B    col X
abc     xyz      1. hello
                 2. hi
                 3. bye

abc2   xyz2      1) bonjur 2) yalla

abc3   xyz3      1. single 2. line 3. format

abc4   xyz4      1. some_val

What I tried:

df['col X'] = df['col X'].str.split('\n')
new_df = df.explode('col X')

But this only works for the first row.

What I expected:


col A   col B    col X
abc     xyz      1. hello

abc     xyz      2. hi
abc     xyz      3. bye

abc2   xyz2      1) bonjur
abc2   xyz2      2) yalla

abc3   xyz3      1. single
abc3   xyz3      2. line
abc3   xyz3      3. format

abc4   xyz4      1. some_val

CodePudding user response:

You can use a regex with lookarounds to split on the space before a digit, then explode:

out = (df
  .assign(**{'col X': df['col X'].str.split(r'(?<=.)\s (?=\d)')})
  .explode('col X', ignore_index=True)
)

Output!

  col A col B        col X
0   abc   xyz     1. hello
1   abc   xyz        2. hi
2   abc   xyz       3. bye
3  abc2  xyz2    1) bonjur
4  abc2  xyz2     2) yalla
5  abc3  xyz3    1. single
6  abc3  xyz3      2. line
7  abc3  xyz3    3. format
8  abc4  xyz4  1. some_val

Regex:

(?<=.)  # match anything before
\s      # split on spaces/newlines if matches before/after are met
(?=\d)  # match digit after
  • Related