Memory efficient looping in dataframe-CodePudding

I have the following dataframe

product_id	weight_in_g
1	50
2	120
3	130
4	200
5	42
6	90

I am trying to match products based on weights within a deviation of 50 using this loop

list1=[]

for row in df[['product_id', 'weight_in_g']].itertuples():
    high = row[1]   50
    low = row[1] - 50 

    id = df['product_id'].loc[((df['weight_in_g'] >= low) & (df['weight_in_g'] <= high)) | (df['weight_in_g'] == 0)]
    list1.append(id)
    
df['weight_matches'] = list1
del list1

Which gives me the output:

product_id	weight_in_g	weight_matches
1	50	1, 5, 6
2	120	2, 3, 6
3	130	3, 2, 6
4	200	4, 6
5	42	5, 6, 1
6	90	6, 1

I'm using this as a filter together with text embedding. So for that reason i'm including all values with "0" which is about 35% of the dataset (I'd rather keep the values instead of not matching 35% of my dataset)

This works with 10.000 and 20.000 rows, but if i'm going higher then that my notebook runs out of memory (13gb ram)

Is there any way i can make this more memory efficient?

CodePudding user response：

You are using memory very inefficiently. The id variable is of type pd.Series, which stores 2 things: the indexes and the values, plus some small overhead. pandas defaults to int64 for integers so it takes 128 bits / 16 bytes to store a single product ID.

Unrelated to memory, you are also using loops, which are slow. You can have it run faster by using numpy broadcasting to take advantage of vectorization.

# If your product ID can fit into int32, convert it to conserve memory
product_ids = df["product_id"].to_numpy(dtype="int32")

# Convert columns to numpy arrays
weight = df["weight_in_g"].to_numpy()
weight_high = weight   50
weight_low = weight - 50

# Raise weight by one dimension to prepare for broadcasting
weight = weight[:, None]

# A naive implementation of numpy broadcasting will require n*n bytes for the
# `mask` array. When n = 50_000, the memory demand is *at least* 2.5GB, which my
# weak PC can't afford. Chunking in a trade-off: bigger chunk size results in
# faster performance but demand more memory. Pick a size that best suits you.
chunk_size = 5000
weight_matches = []
for i in np.arange(0, len(weight), chunk_size):
    chunk = weight[i : i   chunk_size]
    mask = (chunk == 0) | ((weight_low <= chunk) & (chunk <= weight_high))
    weight_matches  = [product_ids[np.nonzero(m)] for m in mask]

df["weight_matches"] = weight_matches

The above code took 8.7s to run 50k rows. I was able to tweak it to run on 100k rows. However, keep in mind that every solution has a limit. If your requirement exceeds that limit, you need to find a new solution.

CodePudding user response：

Here is one way to do it

# do a self join
# query where difference is  /- 50 in weights
# groupby and accumulate the matching products as list
# finally reset and rename the column

(df.merge(df , how='cross', suffixes=(None, '_y'))
 .query('abs(weight_in_g - weight_in_g_y) <= 50')
 .groupby(['product_id','weight_in_g'])['product_id_y'].agg(list)
 ).reset_index()
.rename(columns={'product_id_y':'weight_matches'})

    product_id  weight_in_g     weight_matches
0            1           50     [1, 5, 6]
1            2          120     [2, 3, 6]
2            3          130     [2, 3, 6]
3            4          200     [4]
4            5           42     [1, 5, 6]
5            6           90     [1, 2, 3, 5, 6]