I would like to know if there is a way to calculate a sum over different dataframe rows when they have different rows' number due to missing timestamps. I have the following example:
import pandas as pd
data1 = {'produced': [19.7, 39.1, 86.4, 167.1]}
data2 = {'produced': [22.4, 95, 144.3, 300.2]}
data3 = {'produced': [15.1, 44.1, 80, 302.5]}
df1 = pd.DataFrame(data1, index = ['01/06/2021 09:35', '01/06/2021 09:40', '01/06/2021 09:45', '01/06/2021 09:50'])
df2 = pd.DataFrame(data2, index = ['01/06/2021 09:35', '01/06/2021 09:45', '01/06/2021 09:50', '01/06/2021 09:55'])
df3 = pd.DataFrame(data3, index = ['01/06/2021 09:35', '01/06/2021 09:40', '01/06/2021 09:45', '01/06/2021 09:55'])
As you can see, I have some dfs with distinct row numbers because there are a few "blank" values. What I would like to do is to calculate the sum, over rows, for each 5min. interval and simply skip the blanks. If I sum it over the rows as it is, I would be incorrectly summing different timestamps.
Also, wouldbe possible to store the missing timestamps in an array or something similar? Thank you in advance!
CodePudding user response:
concat then sum:
pd.concat([df1,df2,df3]).sum(level=0)
# pd.concat([df1,df2,df3]).groupby(level=0).sum()
Output:
produced
01/06/2021 09:35 57.2
01/06/2021 09:40 83.2
01/06/2021 09:45 261.4
01/06/2021 09:50 311.4
01/06/2021 09:55 602.7
CodePudding user response:
You can also merge the dataframes on indices and sum across rows. That way, you can see the dataframes that had a missing values.
df_total = df1.add_suffix("_df1")
for i, df in enumerate([df2, df3], start=2):
df = df.add_suffix("_df{}".format(i))
df_total = df_total.merge(df, right_index=True, left_index=True, how="outer")
df_total["sum"] = df_total.sum(axis=1)
print(df_total)
Output:
produced_df1 produced_df2 produced_df3 sum
01/06/2021 09:35 19.7 22.4 15.1 57.2
01/06/2021 09:40 39.1 NaN 44.1 83.2
01/06/2021 09:45 86.4 95.0 80.0 261.4
01/06/2021 09:50 167.1 144.3 NaN 311.4
01/06/2021 09:55 NaN 300.2 302.5 602.7