I have a multiindex dataframe like the one below:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(
{
'trial': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
't': [0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1],
'context':[0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3],
'x' :np.random.rand(40),
'y' : np.random.rand(40),
'z' : np.random.rand(40),
'inferred_context': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1],
}
)
df = df.set_index(["trial", "t","context"])
df= df.unstack('context')
df.columns.set_names(['vals','context'],inplace=True)
df.swaplevel(0,1)
So it has a two-level row (trial,t) and column (vals, context) index.
I want to create a column called "x_inferred_context" which contains the values of "x" for the context specified in "inferred_context" for each trial and time point t.
For example if at row (trial = 0, t = 3), "inferred_context"=0 and "x" at (val = x, context=0) is 0.204452, the "x_inferred_context" column should contain 0.204452 at row (trial=0, t=3). Corresponidngly if at row (trial=1,t=0) the "inferred_context" is 2 and the value of "x" for context 2 is 0.140387, the column "x_inferred_context" at (1,0) should contain 0.140387.
I have achieved this by creating a mask for each possible context and then multiplying the masks by df['x'] and summing.
nc = np.unique((df.columns.get_level_values(level=1))).size
mask = pd.DataFrame( data = [(df['inferred_context'] == c).iloc[:,c] for c in range(nc)]).T
df['x_inferred_context'] = (mask*df['x']).sum(axis=1)
I am new to pandas and want to ask what is the correct way to do this? Is there a less verbose, easier to read and more pandas-like way to do this?
Cheers!
CodePudding user response:
Use a double indexing lookup:
idx, col = pd.factorize(df.index.get_level_values('t'))
idx2 = df['inferred_context'].reindex(col, axis=1).to_numpy()[np.arange(len(df)), idx]
df[('x_inferred_context', None)] = df['x'].reindex(col, axis=1).to_numpy()[np.arange(len(df)), idx2]
Output:
vals x y \
context 0.0 1.0 2.0 3.0 0.0 1.0 2.0
trial t
0 0 0.417022 0.720324 0.000114 0.302333 0.988861 0.748166 0.280444
1 0.146756 0.092339 0.186260 0.345561 0.103226 0.447894 0.908596
2 0.396767 0.538817 0.419195 0.685220 0.287775 0.130029 0.019367
3 0.204452 0.878117 0.027388 0.670468 0.211628 0.265547 0.491573
1 0 0.417305 0.558690 0.140387 0.198101 0.574118 0.146729 0.589306
1 0.800745 0.968262 0.313424 0.692323 0.102334 0.414056 0.694400
2 0.876389 0.894607 0.085044 0.039055 0.049953 0.535896 0.663795
3 0.169830 0.878143 0.098347 0.421108 0.944595 0.586555 0.903402
2 0 0.957890 0.533165 0.691877 0.315516 0.139276 0.807391 0.397677
1 0.686501 0.834626 0.018288 0.750144 0.927509 0.347766 0.750812
vals z inferred_context \
context 3.0 0.0 1.0 2.0 3.0 0.0
trial t
0 0 0.789279 0.883306 0.623672 0.750942 0.348898 0
1 0.293614 0.269928 0.895886 0.428091 0.964840 0
2 0.678836 0.663441 0.621696 0.114746 0.949489 0
3 0.053363 0.449912 0.578390 0.408137 0.237027 0
1 0 0.699758 0.903380 0.573679 0.002870 0.617145 2
1 0.414179 0.326645 0.527058 0.885942 0.357270 2
2 0.514889 0.908535 0.623360 0.015821 0.929437 2
3 0.137475 0.690897 0.997323 0.172341 0.137136 2
2 0 0.165354 0.932595 0.696818 0.066000 0.755463 1
1 0.725998 0.753876 0.923025 0.711525 0.124271 1
vals x_inferred_context
context 1.0 2.0 3.0 NaN
trial t
0 0 0 0 0 0.417022
1 0 0 0 0.146756
2 0 0 0 0.396767
3 0 0 0 0.204452
1 0 2 2 2 0.140387
1 2 2 2 0.313424
2 2 2 2 0.085044
3 2 2 2 0.098347
2 0 1 1 1 0.533165
1 1 1 1 0.834626