Home > Software engineering >  Pandas - Add a new calculated column to a MultiIndex column dataframe
Pandas - Add a new calculated column to a MultiIndex column dataframe

Time:01-17

I have a Dataframe with the following structure:

np.random.seed(1)
mi = pd.MultiIndex.from_product([[3, 5], ["X","Y","V","T"]], names=["Node", "Parameter"])
df = pd.DataFrame(index=pd.DatetimeIndex(['2022-07-07 12:00:00', '2022-07-07 13:00:00',
               '2022-07-07 14:00:00', '2022-07-07 15:00:00',
               '2022-07-07 16:00:00'],
             dtype='datetime64[ns]', name='Date', freq=None), columns=mi, data=np.random.rand(5,8))

print(df)
Node                        3                                       5
Parameter                   X         Y         V         T         X         Y         V         T
Date
2022-07-07 12:00:00  0.417022  0.720324  0.000114  0.302333  0.146756  0.092339  0.186260  0.345561
2022-07-07 13:00:00  0.396767  0.538817  0.419195  0.685220  0.204452  0.878117  0.027388  0.670468
2022-07-07 14:00:00  0.417305  0.558690  0.140387  0.198101  0.800745  0.968262  0.313424  0.692323
2022-07-07 15:00:00  0.876389  0.894607  0.085044  0.039055  0.169830  0.878143  0.098347  0.421108
2022-07-07 16:00:00  0.957890  0.533165  0.691877  0.315516  0.686501  0.834626  0.018288  0.750144

I would like to add a new calculated column "Z" for each Node, based on the value "X" ** 2 "Y" ** 2.

The following achieves the desired result:

x = df.loc[:,(slice(None),"X")]
y = df.loc[:,(slice(None),"Y")]
z = (x**2).rename(columns={"X":"Z"})   (y ** 2).rename(columns={"Y":"Z"})
result = df.join(z).sort_index(axis=1)

Is there a more straightforward way to achieve this? For example, using df.xs to select the desired column data e.g. df.xs("X", axis=1, level=1) **2 df.xs("X", axis=1, level=1) ** 2, how can I then assign the result to the original dataframe?

CodePudding user response:

You can use groupby.apply:

(df.groupby(level='Node', axis=1)
   .apply(lambda g: g.droplevel('Node', axis=1).eval('Z = X**2   Y**2'))
 )

Or, with xs and drop_level=False on one of the values:

(df.join((df.xs('X', axis=1, level=1, drop_level=False)**2
          df.xs('Y', axis=1, level=1)**2
         ).rename(columns={'X': 'Z'}, level=1)
        )
   .sort_index(axis=1, level=0, sort_remaining=False)
 ) 

Output:

Node                        3                                                 5                                        
Parameter                   X         Y         V         T         Z         X         Y         V         T         Z
Date                                                                                                                   
2022-07-07 12:00:00  0.417022  0.720324  0.000114  0.302333  0.692775  0.146756  0.092339  0.186260  0.345561  0.030064
2022-07-07 13:00:00  0.396767  0.538817  0.419195  0.685220  0.447748  0.204452  0.878117  0.027388  0.670468  0.812891
2022-07-07 14:00:00  0.417305  0.558690  0.140387  0.198101  0.486278  0.800745  0.968262  0.313424  0.692323  1.578722
2022-07-07 15:00:00  0.876389  0.894607  0.085044  0.039055  1.568379  0.169830  0.878143  0.098347  0.421108  0.799977
2022-07-07 16:00:00  0.957890  0.533165  0.691877  0.315516  1.201818  0.686501  0.834626  0.018288  0.750144  1.167884

CodePudding user response:

One option is with pd.xs:

out = df.xs('X',axis=1,level=1).pow(2).add(df.xs('Y',axis=1,level=1).pow(2))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)
Node                        3                                                 5                                        
Parameter                   T         V         X         Y         Z         T         V         X         Y         Z
Date                                                                                                                   
2022-07-07 12:00:00  0.302333  0.000114  0.417022  0.720324  0.692775  0.345561  0.186260  0.146756  0.092339  0.030064
2022-07-07 13:00:00  0.685220  0.419195  0.396767  0.538817  0.447748  0.670468  0.027388  0.204452  0.878117  0.812891
2022-07-07 14:00:00  0.198101  0.140387  0.417305  0.558690  0.486278  0.692323  0.313424  0.800745  0.968262  1.578722
2022-07-07 15:00:00  0.039055  0.085044  0.876389  0.894607  1.568379  0.421108  0.098347  0.169830  0.878143  0.799977
2022-07-07 16:00:00  0.315516  0.691877  0.957890  0.533165  1.201818  0.750144  0.018288  0.686501  0.834626  1.167884

Another option, is to select all the columns, run pow across all the columns in one go, before grouping and concatenating:

out = (df
      .loc(axis=1)[:, ['X','Y']]
      .pow(2)
      .groupby(level='Node', axis=1)
      .agg(np.add.reduce,axis=1))
out.columns = [out.columns, np.repeat(['Z'], 2)]
pd.concat([df, out], axis = 1).sort_index(axis=1)
  • Related