Home > Blockchain >  Processing multiple modes in pandas
Processing multiple modes in pandas

Time:04-01

I'm obviously dealing with slightly more complex and realistic data, but to showcase my trouble, let's assume we have these data:

import pandas as pd
import numpy as np

purchases_df = pd.DataFrame({"user_id": [100, 101, 100, 101, 200],
                      "date": ['2022-01-01', '2022-01-01','2022-01-01','2022-01-01', '2022-01-01'],
                      "purchase": ['cookies', 'jam', 'jam', 'jam', 'cashews']})

I want to find modal values of purchases by date:

agg_mode = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(pd.Series.mode)
agg_mode

agg_mode will show that for user_id 100 we have two modal values: [cookies, jam]. This is totally fine with me, when it comes to real data we've come up with a set of rules which mode to pick if there's a tie. The problem is, to use that heuristic, I need to able to check if the returned set of multiple modal values contains certain values (let's say, if cookies and jam are returned, we'd always stick to jam only. I can't find a simple way to process returned multimodal values:

agg_mode_df = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(pd.Series.mode).to_frame()
agg_mode_df.reset_index(inplace=True)
agg_mode_df

agg_mode_df is a DataFrame, and the purchase column (which now holds the modal values) becomes of object dtype with numpy ndarrays in case of more than one mode for the user_id, and I couldn't find a working way to convert the modal value(s) of every single user to a list.

Am I overthinking this?

Thanks in advance!

CodePudding user response:

IIUC, try:

agg_mode = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(lambda x: x.mode().tolist())
  • Related