I have a dataframe contains data column, customer column and size like this:
Customer | Date | Size |
---|---|---|
Cust1 | 20/10/2021 | 4 |
Cust1 | 21/10/2021 | 5 |
Cust1 | 22/10/2021 | 6 |
Cust1 | 23/10/2021 | 6 |
Cust2 | 20/10/2021 | 4 |
Cust2 | 21/10/2021 | 5 |
Cust2 | 22/10/2021 | 6 |
Cust2 | 23/10/2021 | 6 |
Cust3 | 20/10/2021 | 4 |
Cust3 | 21/10/2021 | 5 |
Cust3 | 22/10/2021 | 6 |
Cust3 | 23/10/2021 | 6 |
I need to extract every nth date from a customer and delete the rest. In this example - every 2nd:
|Customer|Date|Size|
|--------|----|----|
|Cust1 |20/10/2021|4|
|Cust1 |22/10/2021|6|
|Cust2 |20/10/2021|4|
|Cust2 |22/10/2021|6|
|Cust3 |20/10/2021|4|
|Cust3 |22/10/2021|6|
Sorry for the bad format, but table formatting doesn't work for the 2nd table.
In reality it's every 10th and day, starting from the most recent one. Tried with group and iloc but isn't working:
df_10 = df.iloc[::10, :]
AttributeError: 'DataFrameGroupBy' object has no attribute 'iloc'
I don't insist to use groupby at all to be honest, but can't find working solution so far.
Thank you
CodePudding user response:
You can use:
df.loc[df.groupby('Customer').cumcount().mod(2).eq(0)]
output:
Customer Date Size
0 Cust1 20/10/2021 4
2 Cust1 22/10/2021 6
4 Cust2 20/10/2021 4
6 Cust2 22/10/2021 6
8 Cust3 20/10/2021 4
10 Cust3 22/10/2021 6
explanation:
df.groupby('Customer').cumcount()
creates a count per group ([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]
), then mod(2)
takes the modulo to give [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
, and eq(0)
enables to select the 0 values.
If you want to take every N rows, starting from the K's one (first being 0): .mod(N).eq(K)
CodePudding user response:
Perhaps this can help? Mark every Nth row per group using pandas
With a dataset that resembles yours one solution could be:
df= pd.DataFrame({'Customer': ['Cust1']*20 ['Cust2']*20,
'date': range(1,40 1),
'size':[10]*40})
n=3 # for every 3th row
df.groupby('Customer', group_keys=False).apply(lambda x: x.iloc[n-1::n])
´´´