Pandas groupby and Max value-CodePudding

I'm trying to get the highest amount for each group using the Pandas library.

An additional condition is to obtain the maximum amount compared to data from 180 days prior to the date in the DATE column.

Below is the data frame I have.

    df = pd.DataFrame({'num': ["50110-374","50110-374","50110-374","50110-374","50110-374","50110-374","50110-374","50110-374","50110-374","50110-374","50110-3421","50110-3421","50110-3421","50110-3421","50110-3421","50110-3421","50110-3421"],
                   'date': ["2017-11-24","2018-02-08","2018-03-08","2018-03-17","2018-04-11","2018-04-16","2018-05-05","2018-06-04","2018-06-16","2018-07-13","2019-01-28","2019-02-10","2019-03-16","2019-03-16","2019-06-07","2019-06-30", "2022-06-30"],
                   'type':["39","39","39","39","39","39","39","39","39","39","73","73","73","73","73","73","73"],
                   'price':[17000,12500,14000,14000,18000,13000,14250,15000,12900,15000,35500,34500,35000,37000,33300,34800, 32000]})`

`    num        date       type price
0   50110-374   2017-11-24  39  17000
1   50110-374   2018-02-08  39  12500
2   50110-374   2018-03-08  39  14000
3   50110-374   2018-03-17  39  14000
4   50110-374   2018-04-11  39  18000
5   50110-374   2018-04-16  39  13000
6   50110-374   2018-05-05  39  14250
7   50110-374   2018-06-04  39  15000
8   50110-374   2018-06-16  39  12900
9   50110-374   2018-07-13  39  15000
10  50110-3421  2019-01-28  73  35500
11  50110-3421  2019-02-10  73  34500
12  50110-3421  2019-03-16  73  35000
13  50110-3421  2019-03-16  73  37000
14  50110-3421  2019-06-07  73  33300
15  50110-3421  2019-06-30  73  34800
16  50110-3421  2022-06-30  73  32000`

Below is the desired MAX column as a result.

I wish 'max' columns like this.

(index 16 max value is 32000 due to 180 days calculation)

`   num         date        type price  max
0   50110-374   2017-11-24  39  17000   NaN
1   50110-374   2018-02-08  39  12500   17000.0
2   50110-374   2018-03-08  39  14000   17000.0
3   50110-374   2018-03-17  39  14000   17000.0
4   50110-374   2018-04-11  39  18000   18000.0
5   50110-374   2018-04-16  39  13000   18000.0
6   50110-374   2018-05-05  39  14250   18000.0
7   50110-374   2018-06-04  39  15000   18000.0
8   50110-374   2018-06-16  39  12900   18000.0
9   50110-374   2018-07-13  39  15000   18000.0
10  50110-3421  2019-01-28  73  35500   NaN
11  50110-3421  2019-02-10  73  34500   35500.0
12  50110-3421  2019-03-16  73  35000   35500.0
13  50110-3421  2019-03-16  73  37000   37000.0
14  50110-3421  2019-06-07  73  33300   37000.0
15  50110-3421  2019-06-30  73  34800   37000.0
16  50110-3421  2022-06-30  73  32000   32000.0`

Thank you for reading it.

I tryed this function. It worked, but working time too long.

    def maxDeal(date):
    testDate = date
    dateIndex = totalMonthList.index(testDate)
    testRange = totalMonthList[dateIndex-720:dateIndex 1]
    
    tmpCdDf = priceApi[priceApi['date'] == testDate][['num','type','date']]
    tmpCdLst = list(tmpCdDf['num'].drop_duplicates())
    maxDf = df[(df['num'].isin(tmpCdLst)) & (df['date'].isin(testRange))].groupby(['date','type'])['price'].max().reset_index()
    tmpCdDf = pd.merge(tmpCdDf,maxDf, how='left', on=['num','type'] )
    maxValue = list(tmpCdDf['price'])
    df.loc[df['date'] == date, 'max'] = maxValue

CodePudding user response：

Step1

change date column to datetime and make 180d series

df['date'] = pd.to_datetime(df['date'])
s = df['date']   pd.Timedelta('180d')

Step2

apply apply function by row, when axis=1

make condition & apply where func and find max

condition:

index must be less or equal than index of row
date 180d must be less or equal than date of row
type is same with type of row

df.apply(lambda x: df['price'].where((df.index <= x.name) & s.ge(x['date']) & df['type'].eq(x['type'])).max(), axis=1)

result

0     17000.0
1     17000.0
2     17000.0
3     17000.0
4     18000.0
5     18000.0
6     18000.0
7     18000.0
8     18000.0
9     18000.0
10    35500.0
11    35500.0
12    35500.0
13    37000.0
14    37000.0
15    37000.0
16    32000.0
dtype: float64

Step3

make result to max column

df['max'] = df.apply(lambda x: df['price'].where((df.index <= x.name) & s.ge(x['date']) & df['type'].eq(x['type'])).max(), axis=1)

df

    num         date       type price   max
0   50110-374   2017-11-24  39  17000   17000.0
1   50110-374   2018-02-08  39  12500   17000.0
2   50110-374   2018-03-08  39  14000   17000.0
3   50110-374   2018-03-17  39  14000   17000.0
4   50110-374   2018-04-11  39  18000   18000.0
5   50110-374   2018-04-16  39  13000   18000.0
6   50110-374   2018-05-05  39  14250   18000.0
7   50110-374   2018-06-04  39  15000   18000.0
8   50110-374   2018-06-16  39  12900   18000.0
9   50110-374   2018-07-13  39  15000   18000.0
10  50110-3421  2019-01-28  73  35500   35500.0
11  50110-3421  2019-02-10  73  34500   35500.0
12  50110-3421  2019-03-16  73  35000   35500.0
13  50110-3421  2019-03-16  73  37000   37000.0
14  50110-3421  2019-06-07  73  33300   37000.0
15  50110-3421  2019-06-30  73  34800   37000.0
16  50110-3421  2022-06-30  73  32000   32000.0

Full Code

df['date'] = pd.to_datetime(df['date'])
s = df['date']   pd.Timedelta('180d')
df['max'] = df.apply(lambda x: df['price'].where((df.index <= x.name) & s.ge(x['date']) & df['type'].eq(x['type'])).max(), axis=1)

CodePudding user response：

you can try to Use groupby transform:

df["max"] = df.groupby('num')["price"].transform('max')