I am trying to convert a str
type column to multiple rows based on different scenarios:
for example, I have the column X as
col A col B col X
abc xyz 1. hello
2. hi
3. bye
abc2 xyz2 1) bonjur 2) yalla
abc3 xyz3 1. single 2. line 3. format
abc4 xyz4 1. some_val
What I tried:
df['col X'] = df['col X'].str.split('\n')
new_df = df.explode('col X')
But this only works for the first row.
What I expected:
col A col B col X
abc xyz 1. hello
abc xyz 2. hi
abc xyz 3. bye
abc2 xyz2 1) bonjur
abc2 xyz2 2) yalla
abc3 xyz3 1. single
abc3 xyz3 2. line
abc3 xyz3 3. format
abc4 xyz4 1. some_val
CodePudding user response:
You can use a regex with lookarounds to split
on the space before a digit, then explode
:
out = (df
.assign(**{'col X': df['col X'].str.split(r'(?<=.)\s (?=\d)')})
.explode('col X', ignore_index=True)
)
Output!
col A col B col X
0 abc xyz 1. hello
1 abc xyz 2. hi
2 abc xyz 3. bye
3 abc2 xyz2 1) bonjur
4 abc2 xyz2 2) yalla
5 abc3 xyz3 1. single
6 abc3 xyz3 2. line
7 abc3 xyz3 3. format
8 abc4 xyz4 1. some_val
Regex:
(?<=.) # match anything before
\s # split on spaces/newlines if matches before/after are met
(?=\d) # match digit after