Optimizing the rewriting of the range values in a column into separate rows-CodePudding

I have a dataframe clothes_acc with column shoe_size containing values like:

index   shoe_size
134     37-38
963     43-45
968     39-42
969     43-45
970     37-39

What I want to do is to write the whole range of values separately in each line. So I would get:

index   shoe_size
134     37
134     38
963     43
963     44
963     45
968     39
968     40
968     41
968     42
...

Currently, I have the following code which works fine except it is very slow for the dataframe with 500k rows. (clothes_acc actually contains other values in the column that are not important here, which is why I take a subset of the dataframe with the mentioned values and save it in the tmp variable).

for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
    clothes_acc = clothes_acc.drop([i])
    spl = [int(s) for s in row['shoe_size']]
    for j in range(spl[0],spl[1] 1):
        replicate = row.copy()
        replicate['shoe_size'] = str(j)
        clothes_acc = clothes_acc.append(replicate)
        
clothes_acc.reset_index(drop=True,inplace=True)

Could anyone please suggest an improvement?

CodePudding user response：

Convert the string range to a list of integer sizes and call explode():

df['shoe_size'] = df.apply(lambda x: 
    [i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1])   1)], 
    axis=1)
df = df.explode(column='shoe_size')

For example, if df is:

df = pd.DataFrame({
    'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})

... this will give the following result:

  shoe_size
0        37
0        38
1        43
1        44
1        45
2        39
2        40
2        41
2        42
3        43
3        44
3        45
4        37
4        38
4        39

CodePudding user response：

One option (more memory intensive) is to extract the bounds of the ranges, merge on all possible values and then filter to where the merged value is between the range. This will work okay when the shoe_sizes overlap for many of the products so that the cross join isn't insanely huge.

import pandas as pd

# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
            .apply(pd.to_numeric)
            .rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
#   index shoe_size  lower  upper
#0    134     37-38     37     38
#1    963     43-45     43     45
#2    968     39-42     39     42
#3    969     43-45     43     45
#4    970     37-39     37     39  

vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(), 
                                            clothes_acc.upper.max() 1)})

res = (clothes_acc.drop(columns='shoe_size')
            .merge(vals, how='cross')
            .query('lower <= shoe_size <= upper')
            .drop(columns=['lower', 'upper']))

print(res)
    index  shoe_size
0     134         37
1     134         38
15    963         43
16    963         44
17    963         45
20    968         39
21    968         40
22    968         41
23    968         42
33    969         43
34    969         44
35    969         45
36    970         37
37    970         38
38    970         39