I have a dataset from which I need to extract user and user data who purchased in two months Jan and Feb. Could you please help me with the code?
The desired output should be like this below
CodePudding user response:
Filter:
df[(df["Purchase month"]=="Jan") | (df["Purchase month"]=="Feb")]
Sort:
df.sort_values(by=["user_id"])
CodePudding user response:
First, we can use a groupby
to get the customer buying in at least 2 different months :
>>> df_grouped = df.groupby(['user_id'], as_index=False)['purchase_month'].nunique()
>>> valid_users = df_grouped[df_grouped['purchase_month']>=2]['user_id'].tolist()
>>> valid_users
[1, 3]
Then we can filter the first DataFrame with those users to get the expected result :
>>> df[df["user_id"].isin(valid_users)].sort_values(by=["user_id"])
user_id purchase_month product
0 1 jan football
3 1 feb bed
2 3 feb toaster
4 3 jan printer
CodePudding user response:
Try:
groupby
and create a list of months for each User id- Only keep rows where months include both Jan and Feb i.e. the set
{"Jan", "Feb"}
is a subset of months
months = df.groupby("User id")["Purchase month"].agg(list)
output = df[df["User id"].isin(months[months.map({"Jan", "Feb"}.issubset)].index)]
>>> output
User id Purchase month Product
0 1 Jan Football
2 3 Feb Toaster
3 1 Feb Bed
4 3 Jan Printer