Home > Mobile >  How to group a dataframe, apply function and avoid index duplication
How to group a dataframe, apply function and avoid index duplication

Time:03-30

Here it is my dataframe

import pandas as pd

data = [[1, 1, 1, 0], 
        [1, 2, 1, 1], 
        [1, 2, 2, 2], 
        [2, 1, 1, 3]]
df = pd.DataFrame(data, columns=["idx 1", "idx 2", "idx 3", "value"])
>>> df
   idx 1  idx 2  idx 3  value
0      1      1      1      0
1      1      2      1      1
2      1      2      2      2
3      2      1      1      3

I have created a "sandwich" function, which copies the first and last row and inserts them in the first and last position:

def sandwich(df):
    return pd.concat([df[:1], df, df[-1:]])
>>> df.apply(sandwich)
   idx 1  idx 2  idx 3  value
0      1      1      1      0  # new
0      1      1      1      0
1      1      2      1      1
2      1      2      2      2
3      2      1      1      3
3      2      1      1      3  # new

The trouble arises when I want to group the dataframe first and then apply the sandwich function to each group.

>>> df.groupby(["idx 1", "idx 2"]).apply(sandwich)
               idx 1  idx 2  idx 3  value
idx 1 idx 2                              
1     1     0      1      1      1      0
            0      1      1      1      0
            0      1      1      1      0
      2     1      1      2      1      1
            1      1      2      1      1
            2      1      2      2      2
            2      1      2      2      2
2     1     3      2      1      1      3
            3      2      1      1      3
            3      2      1      1      3

The result is correct, however I get it practically twice because pandas adds an index and doesn't remove the "idx 1" and "idx 2" columns. I can drop the index afterwards to get the desired result:

>>> df.groupby(["idx 1", "idx 2"]).apply(sandwich).reset_index(drop=True)
    idx 1   idx 2   idx 3   value
0   1   1   1   0
1   1   1   1   0
2   1   1   1   0
3   1   2   1   1
4   1   2   1   1
5   1   2   2   2
6   1   2   2   2
7   2   1   1   3
8   2   1   1   3
9   2   1   1   3

However, creating an index to destroy it immediately after doesn't seem the best approach. Is there a better way?

CodePudding user response:

Use group_key=False in groupby:

df.groupby(["idx 1", "idx 2"], group_keys=False).apply(sandwich)

you will maintain the original index generated by sandwich:

   idx 1  idx 2  idx 3  value
0      1      1      1      0
0      1      1      1      0
0      1      1      1      0
1      1      2      1      1
1      1      2      1      1
2      1      2      2      2
2      1      2      2      2
3      2      1      1      3
3      2      1      1      3
3      2      1      1      3
  • Related