I have two pandas DataFrames
one with floats and one with corresponding dates. I am trying to 'pivot' the one containing floats df1
using the dates in df2
and sum along the rows if more than one date is occurring.
I have tried methods using Pandas
and NumPy
both together and separately, the results are correct, however the code can be very slow 1.2ms per loop over the DataFrame
. Does anyone know how this same task can be achieved with more speed efficient code so that a 1,000,000 row dataset could be passed through this?
I have got close with this code:
l = []
for i in range(len(df1)):
df_g = pd.DataFrame([df1.loc[i].values], columns=df2.loc[i].values)
k = df_g.groupby(df_g.columns, axis=1).sum()
l.append(k)
df_concatenated = pd.concat(l)
Example Data:
df1
floats
0 1 2 3
0 170.590870 55.003686 18.500252 34.174595
1 170.590870 66.790190 12.377866 34.174595
2 140.922893 49.581099 18.500252 34.174595
3 170.590870 49.581099 18.500252 34.174595
df2
0 1 2 3
0 30/06/2023 31/03/2023 25/10/2022 31/12/2023
1 30/06/2024 31/12/2022 25/10/2022 25/10/2022
2 31/12/2022 31/03/2023 25/10/2022 31/12/2023
3 31/12/2022 31/03/2023 25/10/2022 31/12/2023
Expected Output
25/10/2022 31/12/2022 31/03/2023 30/06/2023 31/12/2023 30/06/2024
0 18.500252 NaN 55.003686 170.590870 34.174595 NaN
1 46.552461 66.79019 NaN NaN NaN 170.59087
2 18.500252 140.922893 49.581099 NaN 34.174595 NaN
3 18.500252 170.59087 49.581099 NaN 34.174595 NaN
CodePudding user response:
use concat
pd.concat([df1, df2], axis=1, keys=['value', 'date'])
output:
value date
0 1 2 3 0 1 2 3
0 170.590870 55.003686 18.500252 34.174595 30/06/2023 31/03/2023 25/10/2022 31/12/2023
1 170.590870 66.790190 12.377866 34.174595 30/06/2024 31/12/2022 25/10/2022 25/10/2022
2 140.922893 49.581099 18.500252 34.174595 31/12/2022 31/03/2023 25/10/2022 31/12/2023
3 170.590870 49.581099 18.500252 34.174595 31/12/2022 31/03/2023 25/10/2022 31/12/2023
unpivot
above output and retry pivot
df = pd.concat([df1, df2], axis=1, keys=['value', 'date']).stack().reset_index()
out = (df.pivot_table('value', index='level_0', columns='date', aggfunc='sum')
.rename_axis('').sort_index(axis=1, key=lambda x: pd.to_datetime(x)))
out
date 25/10/2022 31/12/2022 31/03/2023 30/06/2023 31/12/2023 30/06/2024
0 18.500252 NaN 55.003686 170.59087 34.174595 NaN
1 46.552461 66.790190 NaN NaN NaN 170.59087
2 18.500252 140.922893 49.581099 NaN 34.174595 NaN
3 18.500252 170.590870 49.581099 NaN 34.174595 NaN
if dtype of df2
is datetime, don need sort_index