Home > Net >  Flatten and merge a panda data frame with multi-index
Flatten and merge a panda data frame with multi-index

Time:09-07

As part of a larger data clean up exercise I am using pandas.Series.str.extractall(pat) on a data frame column. Calling extractall() returns a multi-index due to multiple matches. I would like to flatten and merge the result from extractall() into a data frame.

I've spent countless hours with various combinations of apply, agg, groupby, join and haven't made any progress. Any suggestions?

Simplified example

df = pd.DataFrame([["cat 123"], ["hat dog 36776"], ["dog 345"], ["fish 456890 hat"]], columns=['A'])
pat = r'(?:(?P<Mammal>(?:cat)|(?:dog))|(?P<Fish>(?:fish))|(?P<Hat>(?:hat)))'
df_ea = df['A'].str.extractall(pat)

Results in this table

Mammal Fish Hat
match
0 0 cat NaN NaN
1 0 NaN NaN hat
1 dog NaN NaN
2 0 dog NaN NaN
3 0 NaN fish NaN
1 NaN NaN hat

Desired result

df_test = pd.DataFrame([["cat", "", ""],
                        ["dog", "", "hat"],
                        ["dog", "", ""],
                        ["", "fish", "hat"]],
                       columns=['Mammal', 'Fish', 'Hat'])
Mammal Fish Hat
0 cat
1 dog hat
2 dog
3 fish hat

CodePudding user response:

You could use groupby as follows:

df_ea = df['A'].str.extractall(pat).groupby(level=0).first()

This will be the result:

Mammal  Fish   Hat
0    cat  None  None
1    dog  None   hat
2    dog  None  None
3   None  fish   hat

CodePudding user response:

df_ea=df_ea.reset_index(drop=True).fillna('')

you've already done a great job you're only missing this last line

  • Related