Home > Blockchain >  Pandas most efficient way to filter dataframe based on groupby mask
Pandas most efficient way to filter dataframe based on groupby mask

Time:06-27

I would like to filter a dataframe based on the values in that df's groupby results on a column. For example, if I have a dataframe with columns := ticker, year, price, I'd like to filter out of the df tickers whose first year is >= 1990.

or more technically where the ticker evaluates to True in df.groupby('ticker').['year'].min() < 1990

I am currently doing it this way:

ticker_min_date_bool = df.groupby('ticker')['year'].min() < 1990 # get booleans
tickers_filt = [i for i in ticker_min_date_bool.index if ticker_min_date_bool[i]] # make list of tickers with criteria
df_new = df[df.ticker.isin(tickers_filt)] # filter df based on above list

However this feels a little clumsy to do in 3 lines and doesn't seem to scale well for larger datasets.

Are there any dataframe methods that accomplish this more efficiently?

CodePudding user response:

Just do transform

ticker_min_date_bool = df.groupby('ticker')['year'].transform('min') < 1990
df_new = df[ticker_min_date_bool]

Or without groupby

s = df.loc[df['year']<1990,'ticker']
df_new = df[df['ticker'].isin(s)]
  • Related