I'm given a Pandas dataframe with columns user_id, item_id, timestamp
. Suppose that there is only one user_id - item_id interaction, i.e., there exist only one timestamp with this particular interaction:
user_id item_id timestamp
0 1 2 123
1 1 3 145
2 4 6 123
3 5 7 198
Given a parameter threshold
I filter out those user_ids that appear <= threshold
number of times:
data = data.groupby("user_id").filter(lambda x: len(x) > threshold)
and get (for threshold = 1
):
user_id item_id timestamp
0 1 2 123
1 1 3 145
because only user "1" has interacted more than threshold
items.
Now, suppose that there may be multiple user_id - item_id interactions, i.e., there could be several timestamps with a particular interaction:
user_id item_id timestamp
0 1 2 123
1 1 2 145
2 4 6 123
3 4 7 198
What would be the most elegant (the fastest) way to filter out those users that have <= threshold
number of unique interactions? The desired output would then be:
user_id item_id timestamp
0 4 6 123
1 4 7 198
(because user "1" has interacted only with 1 item, and user "4" remains there, because he has interacted with 2 items).
One way I thought of (not that elegant, huh?):
data_cold = data.groupby('user_id').agg({'item_id':lambda x: x.nunique()})
data_cold = data_cold.reset_index()
data_cold = data_cold[data_cold.item_id > threshold]
data = data[data['user_id'].isin(data_cold.user_id)]
CodePudding user response:
if I understand correctly, you could group by two columns and then count groups with size(). After that, the unique interactions will have count of 1 and can be filtered if wanted:
data = data.groupby(["user_id", "item_id"]).size().reset_index(name = 'counts')
uniques = data[data['counts'] == 1]
CodePudding user response:
I would also entertain the idea to use groupby with both columns user_id
and item_id
but aggregate differently:
import pandas as pd
#sample data
from io import StringIO
data1 = """
user_id item_id timestamp
0 1 2 123
1 1 2 145
2 4 6 123
3 4 7 198
4 4 7 172
5 1 1 163
6 1 1 172
7 2 4 871
8 4 3 883 """
df = pd.read_csv(StringIO(data1), sep = "\s{2,}", engine="python")
threshold = 3
#strategy: count number of user_id-item_id pairs
#remove rows with at least n NaN values, where n is defined by your threshold
filtered_ID = df.groupby(["user_id", "item_id"]).size().unstack().dropna(thresh=threshold).index
#then filter the original dataframe for the retrieved user_id's
print(df[df["user_id"].isin(filtered_ID)])
Sample output:
user_id item_id timestamp
2 4 6 123
3 4 7 198
4 4 7 172
8 4 3 883