Home > OS >  index compatibility of dataframe with multiindex result from apply on group
index compatibility of dataframe with multiindex result from apply on group

Time:10-14

We have to apply an algorithm to columns in a dataframe, the data has to be grouped by a key and the result shall form a new column in the dataframe. Since it is a common use-case we wonder if we have chosen a correct approach or not.

Following code reflects our approach to the problem in a simplified manner.

import numpy as np
import pandas as pd

np.random.seed(42)

N = 100
key = np.random.randint(0, 2, N).cumsum()
x = np.random.rand(N)
data = dict(key=key, x=x)

df = pd.DataFrame(data)

This generates a DataFrame as follows.

    key         x
0     0  0.969585
1     1  0.775133
2     1  0.939499
3     1  0.894827
4     1  0.597900
..  ...       ...
95   53  0.036887
96   54  0.609564
97   55  0.502679
98   56  0.051479
99   56  0.278646

Application of exemplary methods on the DataFrame groups.

def magic(x, const):
    return (x   np.abs(np.random.rand(len(x)))   float(const)).round(1)

def pandas_confrom_magic(df_per_key, const=1):
    index = df_per_key['x'].index # preserve index
    x = df_per_key['x'].to_numpy()
    y = magic(x, const) # perform some pandas incompatible magic
    return pd.Series(y, index=index) # reconstruct index

g = df.groupby('key')
y_per_g = g.apply(lambda df: pandas_confrom_magic(df, const=5))

When assigning a new column to the result df['y'] = y_per_g it will throw a TypeError.

TypeError: incompatible index of inserted column with frame index

Thus a compatible multiindex needs to be introduced first.

df.index.name = 'index'
df = df.set_index('key', append=True).reorder_levels(['key', 'index'])
df['y'] = y_per_g
df.reset_index('key', inplace=True)

Which yields the intended result.

       key         x    y
index                    
0        0  0.969585  6.9
1        1  0.775133  6.0
2        1  0.939499  6.1
3        1  0.894827  6.4
4        1  0.597900  6.6
...    ...       ...  ...
95      53  0.036887  6.0
96      54  0.609564  6.0
97      55  0.502679  6.5
98      56  0.051479  6.0
99      56  0.278646  6.1

Now we wonder if there is a more straight forward way of dealing with the index and if we generally have chosen a favorable approach.

CodePudding user response:

Use Series.droplevel for remove first level of MultiIndex, so same index like df, so assign working well:

g = df.groupby('key')
df['y']  = g.apply(lambda df: pandas_confrom_magic(df, const=5)).droplevel('key')
print (df)

    key         x    y
0     0  0.969585  6.9
1     1  0.775133  6.0
2     1  0.939499  6.1
3     1  0.894827  6.4
4     1  0.597900  6.6
..  ...       ...  ...
95   53  0.036887  6.0
96   54  0.609564  6.0
97   55  0.502679  6.5
98   56  0.051479  6.0
99   56  0.278646  6.1

[100 rows x 3 columns]
  • Related