I have a task where I need to change multiple times data in my data frame. I wrote the answer in Jupyter notebook, using loops and it's take around 2,5min to run.
However, when I rewrite my code to pycharm using modules and definitions it takes around 20min and I do not know where I made a mistake.
Here is explanation of my task and my idea which was written in Jupyter, maybe you will have some ideas how I could write it better.
I have a data frame with weekly qty of sold toys in factory when 0w last week.
ID 0w 1w 2w 3w 4w 5w 6w 7w 8w 9w 10w 11w 12w 13w
0 0 1 0 0 5 1 65 2 62 1 1 2 1 60
1 0 0 1 5 16 0 2 0 0 40 0 100 0 0
2 0 3 0 0 0 0 0 40 0 0 20 0 0 0
3 0 5 6 0 0 0 0 0 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0 0 0 0 0
First step is to save every row from my df to a list of lists 'week_qty':
week_qty = []
lenOfRows = len(copiedData)
for i in range(0, lenOfRows):
week_qty.append(weeksQtyEXTdata.iloc[i])
week_qty[0] = [0 1 0 0 5 1 65 2 62 1 1 2 1 60]
Second step is to take 90% and 10% value of each row and compare with it each value of the list, so for the first row 90% = 61.4 and 10% = 0. If the value in cell is lower than p10 I change it to the value of p10 and if it' higher than p90 I change it with the value of p90.
def CalcPercenatage(week_qty,oneWeek):
p10=np.percentile(weekDemand,10)
p90=np.percentile(weekDemand,90)
if (oneWeek < p10):
return p10
elif(oneWeek > p90):
return p90
else:
return oneWeek
CalcPercenatage(week_qty[0]) = [60, 1, 2, 1, 1, 61.4, 2, 61.4, 1, 5, 0, 0, 1, 0]
Last step is to create a matrix of those values and do it for every row for each of 14 cells in a row:
for i in range(0, lenOfRows):
Rows = []
for j in range(0, 14):
Rows.append(CalcPercenatage(week_qty[i], week_qty[i][j]))
MatrixBetweenWeeks.append(Rows)
I would like to make it faster, for 31000 data in pycharm it is working too long.
CodePudding user response:
You can use clip
:
p10, p90 = np.percentile(df.iloc[:, 1:], [10, 90], axis=1)
out = df.iloc[:, 1:].clip(p10, p90, axis=0)
out['Average'] = out.mean(axis=1)
out = pd.concat([df.iloc[:, :1], out], axis=1)
Output:
>>> out
ID 0w 1w 2w 3w 4w 5w 6w 7w 8w 9w 10w 11w 12w 13w Average
0 0 0 1.0 0.0 0 5 1 61.4 2.0 61.4 1.0 1.0 2.0 1 60 14.057143
1 1 0 0.0 1.0 5 16 0 2.0 0.0 0.0 32.8 0.0 32.8 0 0 6.400000
2 2 0 3.0 0.0 0 0 0 0.0 14.9 0.0 0.0 14.9 0.0 0 0 2.342857
3 3 0 3.5 3.5 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.500000
4 4 0 0.0 0.0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000000
Performance
For 31K records:
%timeit myfunc(df)
15.3 ms ± 80 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)