I have a dataframe clothes_acc
with column shoe_size
containing values like:
index shoe_size
134 37-38
963 43-45
968 39-42
969 43-45
970 37-39
What I want to do is to write the whole range of values separately in each line. So I would get:
index shoe_size
134 37
134 38
963 43
963 44
963 45
968 39
968 40
968 41
968 42
...
Currently, I have the following code which works fine except it is very slow for the dataframe with 500k rows. (clothes_acc
actually contains other values in the column that are not important here, which is why I take a subset of the dataframe with the mentioned values and save it in the tmp
variable).
for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
clothes_acc = clothes_acc.drop([i])
spl = [int(s) for s in row['shoe_size']]
for j in range(spl[0],spl[1] 1):
replicate = row.copy()
replicate['shoe_size'] = str(j)
clothes_acc = clothes_acc.append(replicate)
clothes_acc.reset_index(drop=True,inplace=True)
Could anyone please suggest an improvement?
CodePudding user response:
Convert the string range to a list of integer sizes and call explode()
:
df['shoe_size'] = df.apply(lambda x:
[i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1]) 1)],
axis=1)
df = df.explode(column='shoe_size')
For example, if df is:
df = pd.DataFrame({
'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})
... this will give the following result:
shoe_size
0 37
0 38
1 43
1 44
1 45
2 39
2 40
2 41
2 42
3 43
3 44
3 45
4 37
4 38
4 39
CodePudding user response:
One option (more memory intensive) is to extract the bounds of the ranges, merge on all possible values and then filter to where the merged value is between the range. This will work okay when the shoe_sizes overlap for many of the products so that the cross join isn't insanely huge.
import pandas as pd
# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
# index shoe_size lower upper
#0 134 37-38 37 38
#1 963 43-45 43 45
#2 968 39-42 39 42
#3 969 43-45 43 45
#4 970 37-39 37 39
vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(),
clothes_acc.upper.max() 1)})
res = (clothes_acc.drop(columns='shoe_size')
.merge(vals, how='cross')
.query('lower <= shoe_size <= upper')
.drop(columns=['lower', 'upper']))
print(res)
index shoe_size
0 134 37
1 134 38
15 963 43
16 963 44
17 963 45
20 968 39
21 968 40
22 968 41
23 968 42
33 969 43
34 969 44
35 969 45
36 970 37
37 970 38
38 970 39