Home > Mobile >  Pandas pad dataframe groups
Pandas pad dataframe groups

Time:06-14

I have a dataframe e.g.:

  my_label   value
0        A   1
1        A   85
2        B   65
3        B   41
4        B   21
5        C   3

I want to group by my_label and to pad groups to a certain length modulo and filling by last value. For example if I want to have multiple of 4, it would give :

  my_label   value
0        A   1
1        A   85
2        A   85
3        A   85
4        B   65
5        B   41
6        B   21
7        B   21
8        C   3
9        C   3
10       C   3
11       C   3

I managed to get a solution that should be working, but for some reason the reindex isn't done at the end of the groups.

def _pad(group, seq_len):
    pad_number = seq_len - (len(group) % seq_len)
    if pad_number != seq_len:
        group = group.reindex(range(len(group) pad_number)).ffill()
    return group
df = (df.groupby('my_label')
        .apply(_pad, (4))
        .reset_index(drop = True))

Here is the code to the above DF for testing :

import pandas as pd
df = pd.DataFrame({"my_label":["A","A","B","B","B","C"], "value":[1,85,65,41,21,3]})

CodePudding user response:

You can concatenate per group a dummy DataFrame with the number of missing rows, then ffill:

N = 4
out = (df
 .groupby('my_label', group_keys=False)
 .apply(lambda d: pd.concat([d, pd.DataFrame(columns=d.columns,
                                             index=range(N-len(d)))]))
 .ffill()
 .reset_index(drop=True)
)

or, directly concatenating the last row as many times as needed:

(df
 .groupby('my_label', group_keys=False)
 .apply(lambda d: pd.concat([d, d.loc[[d.index[-1]]*(N-len(d))]]))
 .reset_index(drop=True)
)

output:

   my_label  value
0         A      1
1         A     85
2         A     85
3         A     85
4         B     65
5         B     41
6         B     21
7         B     21
8         C      3
9         C      3
10        C      3
11        C      3

CodePudding user response:

You can simply solve this by creating an index that represents your desired output, aligning that to your existing data, and then forward filling.

index = pd.MultiIndex.from_product([df['my_label'].unique(), range(4)], names=['my_label', None])

out = (
    df.set_index(
        ['my_label', df.groupby('my_label').cumcount()]
    )
    .reindex(index, method='ffill')
)

print(out)
            value
my_label         
A        0    1.0
         1   85.0
         2   85.0
         3   85.0
B        0   65.0
         1   41.0
         2   21.0
         3   21.0
C        0    3.0
         1    3.0
         2    3.0
         3    3.0
  • Related