I have list of dfs (in the code below named "Input") which can vary in its length (regarding the amount of dfs) and I want to sum specific sections of each column which refers to the year in the first column of each df. In the end I want my code to give me back the exact same list of dfs but with the summed values in each column (in the code below named "output"). So, I want to get from "Input" to "output".
import pandas as pd
# Input
l1 = [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023]
l2 = [1,2,3,4,5,6,7,8]
l3 = [2,3,4,5,6,7,8,9]
l4 = [1,2,3,4,5,6,7,8]
l5 = [2,3,4,5,6,7,8,9]
l6 = [1,2,3,4,5,6,7,8]
df00_in = pd.concat([pd.DataFrame(l1), pd.DataFrame(l2),
pd.DataFrame(l3), pd.DataFrame(l4), pd.DataFrame(l5), pd.DataFrame(l6)], axis = 1)
df01_in = pd.concat([pd.DataFrame(l1), pd.DataFrame(l2),
pd.DataFrame(l3), pd.DataFrame(l4), pd.DataFrame(l5), pd.DataFrame(l6)], axis = 1)
df02_in = pd.concat([pd.DataFrame(l1), pd.DataFrame(l2),
pd.DataFrame(l3), pd.DataFrame(l4), pd.DataFrame(l5), pd.DataFrame(l6)], axis = 1)
# list of input dfs
Input = [df00_in, df01_in, df02_in]
###################################
# Output
l7 = [2022, 2023]
l8 = [sum(l2[:4]), sum(l2[4:])]
l9 = [sum(l3[:4]), sum(l3[4:])]
l10 = [sum(l4[:4]), sum(l4[4:])]
l11 = [sum(l5[:4]), sum(l5[4:])]
l12 = [sum(l6[:4]), sum(l6[4:])]
df00_out = pd.concat([pd.DataFrame(l7), pd.DataFrame(l8),
pd.DataFrame(l9), pd.DataFrame(l10), pd.DataFrame(l11), pd.DataFrame(l12)], axis = 1)
df01_out = pd.concat([pd.DataFrame(l7), pd.DataFrame(l8),
pd.DataFrame(l9), pd.DataFrame(l10), pd.DataFrame(l11), pd.DataFrame(l12)], axis = 1)
df02_out = pd.concat([pd.DataFrame(l7), pd.DataFrame(l8),
pd.DataFrame(l9), pd.DataFrame(l10), pd.DataFrame(l11), pd.DataFrame(l12)], axis = 1)
# list of output dfs
output = [df00_out, df01_out, df02_out]
In the example above I added each year only four times (quarterly), but it can happen that every year will appear 12 times (so monthly and the dfs will expand in the same way). So, my idea is some kind of "SumIf: years are the same". Any ideas on that? THX!
CodePudding user response:
I you're looking for an equivalent to SumIf, take a look at groupby operations.
You also probably want to define your dataframe with column names; pd.concat
is typically used if you're operating with already existing dataframes.
It is not clear what difference you expect between the dataframes as all inputs (and all outputs) in your example are equal; the following code should give you your answer.
import pandas as pd
# Input
l1 = [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023]
l2 = [1, 2, 3, 4, 5, 6, 7, 8]
l3 = [2, 3, 4, 5, 6, 7, 8, 9]
l4 = [1, 2, 3, 4, 5, 6, 7, 8]
l5 = [2, 3, 4, 5, 6, 7, 8, 9]
l6 = [1, 2, 3, 4, 5, 6, 7, 8]
df = pd.DataFrame({"year": l1, "l2": l2, "l3": l3, "l4": l4, "l5": l5, "l6": l6})
df.groupby("year", as_index=False).sum()
If you need to apply this to multiple dataframes, you can use
output_dfs = [df.groupby("year", as_index=False).sum() for df in input_dfs]
CodePudding user response:
You have a very strange way of defining dataframe, you can simply give a list of list (or list of dict, or dict of list, or dict of dict, etc.). You should check the doc on that.
For your question, you can simply use groupby (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) With groupby you can group your dataframe by one or several columns (put together if they have the same value). And there is even a sum method you can directly apply on it.
Like that :
import pandas as pd
# Input
l1 = [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023]
l2 = [1,2,3,4,5,6,7,8]
l3 = [2,3,4,5,6,7,8,9]
l4 = [1,2,3,4,5,6,7,8]
l5 = [2,3,4,5,6,7,8,9]
l6 = [1,2,3,4,5,6,7,8]
df00_in = pd.DataFrame([l1,l2,l3,l4,l5,l6]).T
df01_in = pd.DataFrame([l1,l2,l3,l4,l5,l6]).T
df02_in = pd.DataFrame([l1,l2,l3,l4,l5,l6]).T
# list of input dfs
Input = [df00_in, df01_in, df02_in]
###################################
output1 = [df.groupby(by=[0]).sum().reset_index() for df in Input]
note that I added reset_index() to have the same exact output as you, but it's definitely not necessary.