I'm struggling with,what appears to me, a simple problem. I have a pandas dataframe like this:
results = pd.DataFrame([['executing (i) run', '2 (i)', 3],
['sampling (i) run', '3 (i)', 3]],
columns=['operation', 'executions', 'result'])
So, the input is:
In [1]: results
Out[1]:
operation executions result
0 executing (i) run 2 (i) 3
1 sampling (i) run 3 (i) 3
What I want to do is populate the results dataframe given a parameter and update the value of the cells. Lets say i = 4, the desired output is:
In [2]: results_populated
Out[2]:
operation executions result
0 executing (0) run 2 (0) 3
1 executing (1) run 2 (1) 3
2 executing (2) run 2 (2) 3
3 executing (3) run 2 (3) 3
4 sampling (0) run 3 (0) 3
5 sampling (1) run 3 (1) 3
6 sampling (2) run 3 (2) 3
7 sampling (3) run 3 (3) 3
I know I can iterate each row in a for loop or use iter_rows (or similar) but this does not scale well when the table has hundreds of different operations and the "i" can be thousands. I found this to be very optimal:
res_expanded = results.loc[results.index[results['operation'].str
.contains(r"(i)", regex = True)]
.repeat(i)].reset_index(drop=True)
and returns:
operation executions result
0 executing (i) run 2 (i) 3
1 executing (i) run 2 (i) 3
2 executing (i) run 2 (i) 3
3 executing (i) run 2 (i) 3
4 sampling (i) run 3 (i) 3
5 sampling (i) run 3 (i) 3
6 sampling (i) run 3 (i) 3
7 sampling (i) run 3 (i) 3
But I cannot find an optimal (vectorized?) way to perform the update of each cell. Any help will be appreciated.
Many thanks in advance.
CodePudding user response:
step1:
df = pd.DataFrame([['executing (i) run', '2 (i)', 3],
['sampling (i) run', '3 (i)', 3]],
columns=['operation', 'executions', 'result'])
df = df.apply(lambda x: x.repeat(4))
df
operation executions result
0 executing (i) run 2 (i) 3
0 executing (i) run 2 (i) 3
0 executing (i) run 2 (i) 3
0 executing (i) run 2 (i) 3
1 sampling (i) run 3 (i) 3
1 sampling (i) run 3 (i) 3
1 sampling (i) run 3 (i) 3
1 sampling (i) run 3 (i) 3
step2:
df = df.assign(tag=[*range(4)] * df.groupby('operation').ngroups)
df
operation executions result tag
0 executing (i) run 2 (i) 3 0
0 executing (i) run 2 (i) 3 1
0 executing (i) run 2 (i) 3 2
0 executing (i) run 2 (i) 3 3
1 sampling (i) run 3 (i) 3 0
1 sampling (i) run 3 (i) 3 1
1 sampling (i) run 3 (i) 3 2
1 sampling (i) run 3 (i) 3 3
step3:
df.apply(lambda ser: ser.map(lambda x: x.replace('(i)', f'({ser.tag})') if isinstance(x, str) else x), axis=1)
operation executions result tag
0 executing (0) run 2 (0) 3 0
0 executing (1) run 2 (1) 3 1
0 executing (2) run 2 (2) 3 2
0 executing (3) run 2 (3) 3 3
1 sampling (0) run 3 (0) 3 0
1 sampling (1) run 3 (1) 3 1
1 sampling (2) run 3 (2) 3 2
1 sampling (3) run 3 (3) 3 3
Done!