Home > Blockchain >  Add/ delete rows based on value counts of certain column
Add/ delete rows based on value counts of certain column

Time:11-15

Given the below dataframe

import pandas as pd

df = pd.DataFrame({
    "ID": [ "1", "1", "1", "1", "1", "2", "2", "3", "3", "3", "4", "4"],
    "Feature": [ 2, 6, 4, 5, 6, 3, 1, 6, 3, 5, 7, 1]
})

such that if the value count of certain ID is less than 4 (e.g. value count of ID 2,3,4 are 2,3,2 respectively), then copy the last row that appears for that ID and pad it so that the value count of that ID becomes 4, and if the value count of certain ID is greater than 4 (e.g. value count of ID 1 is 5), then delete the last rows that appears for that ID so that the value count of that ID becomes 4.

So given the above dataframe:

          ID Feature
0         1  2
1         1  6
2         1  4
3         1  5
4         1  6
5         2  3
6         2  1
7         3  6
8         3  3
9         3  5
10        4  7
11        4  1

the outcome should be

          ID Feature
0         1  2
1         1  6
2         1  4
3         1  5
4         2  3
5         2  1
6         2  1
7         2  1
8         3  6
9         3  3
10        3  5
11        3  5
12        4  7
13        4  1
14        4  1
15        4  1

Is there any efficient way/ vectorized way to do that? Thank you

CodePudding user response:

You could craft a MultiIndex from a product and reindex, this will limit the number of rows to 4, then ffill to add the missing values:

idx = pd.MultiIndex.from_product([df['ID'].unique(), range(4)], names=['ID', ''])

(df.set_index(['ID', df.groupby('ID').cumcount()])
   .reindex(idx)
   .ffill(downcast='infer')
   .droplevel(1)
   .reset_index()
)

output:

   ID  Feature
0   1        2
1   1        6
2   1        4
3   1        5
4   2        3
5   2        1
6   2        1
7   2        1
8   3        6
9   3        3
10  3        5
11  3        5
12  4        7
13  4        1
14  4        1

CodePudding user response:

Use DataFrame.reindex with method='ffill' for filter and add values to MultiIndex created by MultiIndex.from_product with counter by GroupBy.cumcount, last remove helper level in MultiIndex:

df['g'] = df.groupby('ID').cumcount()

mux = pd.MultiIndex.from_product([df['ID'].unique(), range(4)], names=['ID','g'])
df = (df.set_index(['ID','g'])
        .reindex(mux, method='ffill')
        .reset_index(level=1, drop=True)
        .reset_index())
print (df)
   ID  Feature
0   1        2
1   1        6
2   1        4
3   1        5
4   2        3
5   2        1
6   2        1
7   2        1
8   3        6
9   3        3
10  3        5
11  3        5
12  4        7
13  4        1
14  4        1
15  4        1
  • Related