Pivot and grouping a DataFrame using another DataFrame with speed-CodePudding

I have two pandas DataFrames one with floats and one with corresponding dates. I am trying to 'pivot' the one containing floats df1 using the dates in df2 and sum along the rows if more than one date is occurring.

I have tried methods using Pandas and NumPy both together and separately, the results are correct, however the code can be very slow 1.2ms per loop over the DataFrame. Does anyone know how this same task can be achieved with more speed efficient code so that a 1,000,000 row dataset could be passed through this?

I have got close with this code:

        l = []
        for i in range(len(df1)):
            df_g = pd.DataFrame([df1.loc[i].values], columns=df2.loc[i].values)
            k = df_g.groupby(df_g.columns, axis=1).sum()
            l.append(k)

        df_concatenated = pd.concat(l)

Example Data: df1 floats

            0          1          2          3
0  170.590870  55.003686  18.500252  34.174595
1  170.590870  66.790190  12.377866  34.174595
2  140.922893  49.581099  18.500252  34.174595
3  170.590870  49.581099  18.500252  34.174595

df2

            0           1           2           3
0  30/06/2023  31/03/2023  25/10/2022  31/12/2023
1  30/06/2024  31/12/2022  25/10/2022  25/10/2022
2  31/12/2022  31/03/2023  25/10/2022  31/12/2023
3  31/12/2022  31/03/2023  25/10/2022  31/12/2023

Expected Output

   25/10/2022  31/12/2022  31/03/2023  30/06/2023  31/12/2023  30/06/2024
0   18.500252         NaN   55.003686  170.590870   34.174595         NaN
1   46.552461    66.79019         NaN         NaN         NaN   170.59087
2   18.500252  140.922893   49.581099         NaN   34.174595         NaN
3   18.500252   170.59087   49.581099         NaN   34.174595         NaN

CodePudding user response：

use concat

pd.concat([df1, df2], axis=1, keys=['value', 'date'])

output:

value                                               date
0               1           2           3           0           1           2           3
0   170.590870  55.003686   18.500252   34.174595   30/06/2023  31/03/2023  25/10/2022  31/12/2023
1   170.590870  66.790190   12.377866   34.174595   30/06/2024  31/12/2022  25/10/2022  25/10/2022
2   140.922893  49.581099   18.500252   34.174595   31/12/2022  31/03/2023  25/10/2022  31/12/2023
3   170.590870  49.581099   18.500252   34.174595   31/12/2022  31/03/2023  25/10/2022  31/12/2023

unpivot above output and retry pivot

df = pd.concat([df1, df2], axis=1, keys=['value', 'date']).stack().reset_index()
out = (df.pivot_table('value', index='level_0', columns='date', aggfunc='sum')
   .rename_axis('').sort_index(axis=1, key=lambda x: pd.to_datetime(x)))

out

date    25/10/2022  31/12/2022  31/03/2023  30/06/2023  31/12/2023  30/06/2024
0       18.500252   NaN         55.003686   170.59087   34.174595   NaN
1       46.552461   66.790190   NaN         NaN         NaN         170.59087
2       18.500252   140.922893  49.581099   NaN         34.174595   NaN
3       18.500252   170.590870  49.581099   NaN         34.174595   NaN

if dtype of df2 is datetime, don need sort_index