how to get smallest index in dataframe after using groupby-CodePudding

If create_date field does not correspond to period between from_date and to_date, I want to extract only the large index records using group by 'indicator' and record correspond to period between from_date to end_date.

from_date = '2022-01-01'
to_date = '2022-04-10'

   indicator    create_date
0      A         2022-01-03
1      B         2021-12-30
2      B         2021-07-11
3      C         2021-02-10
4      C         2021-09-08
5      C         2021-07-24
6      C         2021-01-30

Here is the result I want:

   indicator   create_date
0      A         2022-01-03
2      B         2021-07-11
6      C         2021-01-30

I've been looking for a solution for a long time, but I only found a way "How to get the index of smallest value", and I can't find a way to compare the index number.

CodePudding user response：

You can create helper column for maximal index values per indicator created by DataFrameGroupBy.idxmax, last select rows by DataFrame.loc:

df2 = df.loc[df.assign(tmp=df.index).groupby('indicator')['tmp'].idxmax()]
print (df2)
  indicator create_date
0         A  2022-01-03
2         B  2021-07-11
6         C  2021-01-30

EDIT: If need seelct maximal index only per not match values between from_date, to_date use boolean indexing with join by concat:

from_date = '2022-01-01'
to_date = '2022-04-10'

df['create_date'] = pd.to_datetime(df['create_date'])

m = df['create_date'].between(from_date, to_date)

df2 = df.loc[df.assign(tmp=df.index)[~m].groupby('indicator')['tmp'].idxmax()]
print (df2)
  indicator create_date
2         B  2021-07-11
6         C  2021-01-30

df = pd.concat([df[m], df2])
print (df)
  indicator create_date
0         A  2022-01-03
2         B  2021-07-11
6         C  2021-01-30

CodePudding user response：

You can try

df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)

df_ = df[~m].groupby('indicator', as_index=False).apply(lambda g: g.loc[[max(g.index)]]).droplevel(level=0)
out = pd.concat([df[m], df_], axis=0).sort_index()

print(out)

  indicator create_date
0         A  2022-01-03
2         B  2021-07-11
6         C  2021-01-30