I have the following dataframe
product_id | weight_in_g |
---|---|
1 | 50 |
2 | 120 |
3 | 130 |
4 | 200 |
5 | 42 |
6 | 90 |
I am trying to match products based on weights within a deviation of 50 using this loop
list1=[]
for row in df[['product_id', 'weight_in_g']].itertuples():
high = row[1] 50
low = row[1] - 50
id = df['product_id'].loc[((df['weight_in_g'] >= low) & (df['weight_in_g'] <= high)) | (df['weight_in_g'] == 0)]
list1.append(id)
df['weight_matches'] = list1
del list1
Which gives me the output:
product_id | weight_in_g | weight_matches |
---|---|---|
1 | 50 | 1, 5, 6 |
2 | 120 | 2, 3, 6 |
3 | 130 | 3, 2, 6 |
4 | 200 | 4, 6 |
5 | 42 | 5, 6, 1 |
6 | 90 | 6, 1 |
I'm using this as a filter together with text embedding. So for that reason i'm including all values with "0" which is about 35% of the dataset (I'd rather keep the values instead of not matching 35% of my dataset)
This works with 10.000 and 20.000 rows, but if i'm going higher then that my notebook runs out of memory (13gb ram)
Is there any way i can make this more memory efficient?
CodePudding user response:
You are using memory very inefficiently. The id
variable is of type pd.Series
, which stores 2 things: the indexes and the values, plus some small overhead. pandas defaults to int64
for integers so it takes 128 bits / 16 bytes to store a single product ID.
Unrelated to memory, you are also using loops, which are slow. You can have it run faster by using numpy broadcasting to take advantage of vectorization.
# If your product ID can fit into int32, convert it to conserve memory
product_ids = df["product_id"].to_numpy(dtype="int32")
# Convert columns to numpy arrays
weight = df["weight_in_g"].to_numpy()
weight_high = weight 50
weight_low = weight - 50
# Raise weight by one dimension to prepare for broadcasting
weight = weight[:, None]
# A naive implementation of numpy broadcasting will require n*n bytes for the
# `mask` array. When n = 50_000, the memory demand is *at least* 2.5GB, which my
# weak PC can't afford. Chunking in a trade-off: bigger chunk size results in
# faster performance but demand more memory. Pick a size that best suits you.
chunk_size = 5000
weight_matches = []
for i in np.arange(0, len(weight), chunk_size):
chunk = weight[i : i chunk_size]
mask = (chunk == 0) | ((weight_low <= chunk) & (chunk <= weight_high))
weight_matches = [product_ids[np.nonzero(m)] for m in mask]
df["weight_matches"] = weight_matches
The above code took 8.7s to run 50k rows. I was able to tweak it to run on 100k rows. However, keep in mind that every solution has a limit. If your requirement exceeds that limit, you need to find a new solution.
CodePudding user response:
Here is one way to do it
# do a self join
# query where difference is /- 50 in weights
# groupby and accumulate the matching products as list
# finally reset and rename the column
(df.merge(df , how='cross', suffixes=(None, '_y'))
.query('abs(weight_in_g - weight_in_g_y) <= 50')
.groupby(['product_id','weight_in_g'])['product_id_y'].agg(list)
).reset_index()
.rename(columns={'product_id_y':'weight_matches'})
product_id weight_in_g weight_matches
0 1 50 [1, 5, 6]
1 2 120 [2, 3, 6]
2 3 130 [2, 3, 6]
3 4 200 [4]
4 5 42 [1, 5, 6]
5 6 90 [1, 2, 3, 5, 6]