Home > Mobile >  How to populate pandas dataframe given a parameter
How to populate pandas dataframe given a parameter

Time:06-01

I'm struggling with,what appears to me, a simple problem. I have a pandas dataframe like this:

results = pd.DataFrame([['executing (i) run', '2 (i)', 3],
                        ['sampling (i) run', '3 (i)', 3]],
                columns=['operation', 'executions', 'result'])

So, the input is:

In [1]: results
Out[1]: 
           operation executions  result
0  executing (i) run      2 (i)       3
1   sampling (i) run      3 (i)       3

What I want to do is populate the results dataframe given a parameter and update the value of the cells. Lets say i = 4, the desired output is:

In [2]: results_populated
Out[2]: 
           operation executions  result
0  executing (0) run      2 (0)       3
1  executing (1) run      2 (1)       3
2  executing (2) run      2 (2)       3
3  executing (3) run      2 (3)       3
4   sampling (0) run      3 (0)       3
5   sampling (1) run      3 (1)       3
6   sampling (2) run      3 (2)       3
7   sampling (3) run      3 (3)       3

I know I can iterate each row in a for loop or use iter_rows (or similar) but this does not scale well when the table has hundreds of different operations and the "i" can be thousands. I found this to be very optimal:

res_expanded = results.loc[results.index[results['operation'].str
                                  .contains(r"(i)", regex = True)]
                                  .repeat(i)].reset_index(drop=True)

and returns:

           operation executions  result
0  executing (i) run      2 (i)       3
1  executing (i) run      2 (i)       3
2  executing (i) run      2 (i)       3
3  executing (i) run      2 (i)       3
4   sampling (i) run      3 (i)       3
5   sampling (i) run      3 (i)       3
6   sampling (i) run      3 (i)       3
7   sampling (i) run      3 (i)       3

But I cannot find an optimal (vectorized?) way to perform the update of each cell. Any help will be appreciated.

Many thanks in advance.

CodePudding user response:

step1:

df = pd.DataFrame([['executing (i) run', '2 (i)', 3],
                        ['sampling (i) run', '3 (i)', 3]],
                columns=['operation', 'executions', 'result'])

df = df.apply(lambda x: x.repeat(4))
df
           operation executions  result
0  executing (i) run      2 (i)       3
0  executing (i) run      2 (i)       3
0  executing (i) run      2 (i)       3
0  executing (i) run      2 (i)       3
1   sampling (i) run      3 (i)       3
1   sampling (i) run      3 (i)       3
1   sampling (i) run      3 (i)       3
1   sampling (i) run      3 (i)       3

step2:

df = df.assign(tag=[*range(4)] * df.groupby('operation').ngroups)
df
           operation executions  result  tag
0  executing (i) run      2 (i)       3    0
0  executing (i) run      2 (i)       3    1
0  executing (i) run      2 (i)       3    2
0  executing (i) run      2 (i)       3    3
1   sampling (i) run      3 (i)       3    0
1   sampling (i) run      3 (i)       3    1
1   sampling (i) run      3 (i)       3    2
1   sampling (i) run      3 (i)       3    3

step3:

df.apply(lambda ser: ser.map(lambda x: x.replace('(i)', f'({ser.tag})') if isinstance(x, str) else x), axis=1)
           operation executions  result  tag
0  executing (0) run      2 (0)       3    0
0  executing (1) run      2 (1)       3    1
0  executing (2) run      2 (2)       3    2
0  executing (3) run      2 (3)       3    3
1   sampling (0) run      3 (0)       3    0
1   sampling (1) run      3 (1)       3    1
1   sampling (2) run      3 (2)       3    2
1   sampling (3) run      3 (3)       3    3

Done!

  • Related