If create_date field does not correspond to period between from_date and to_date, I want to extract only the large index records using group by 'indicator' and record correspond to period between from_date to end_date.
from_date = '2022-01-01'
to_date = '2022-04-10'
indicator create_date
0 A 2022-01-03
1 B 2021-12-30
2 B 2021-07-11
3 C 2021-02-10
4 C 2021-09-08
5 C 2021-07-24
6 C 2021-01-30
Here is the result I want:
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
I've been looking for a solution for a long time, but I only found a way "How to get the index of smallest value", and I can't find a way to compare the index number.
CodePudding user response:
You can create helper column for maximal index values per indicator
created by DataFrameGroupBy.idxmax
, last select rows by DataFrame.loc
:
df2 = df.loc[df.assign(tmp=df.index).groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
EDIT: If need seelct maximal index only per not match values between from_date, to_date
use boolean indexing
with join by concat
:
from_date = '2022-01-01'
to_date = '2022-04-10'
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df2 = df.loc[df.assign(tmp=df.index)[~m].groupby('indicator')['tmp'].idxmax()]
print (df2)
indicator create_date
2 B 2021-07-11
6 C 2021-01-30
df = pd.concat([df[m], df2])
print (df)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30
CodePudding user response:
You can try
df['create_date'] = pd.to_datetime(df['create_date'])
m = df['create_date'].between(from_date, to_date)
df_ = df[~m].groupby('indicator', as_index=False).apply(lambda g: g.loc[[max(g.index)]]).droplevel(level=0)
out = pd.concat([df[m], df_], axis=0).sort_index()
print(out)
indicator create_date
0 A 2022-01-03
2 B 2021-07-11
6 C 2021-01-30