I have got 1 huge df (over 5 million rows) and 5 columns in jupyter notebook. 'Name' column has 20 unique values and 'lot' column has 10 unique values. So 200 unique combinations of these two columns. I need to subset the df based on the unique combinations of these two columns, make some calculations and return some parts of the calculations in a final df. Final df will have 200 rows, one for each iteration/subset/combination.
For example (with only 2 names and 3 lots = 6 combinations):
dfhuge
index Name lot col3 col4 col5
123 delta 1 786 10 1
657 delta 2 787 11 2
567 delta 2 777 13 4
456 bravo 3 775 12 3
789 bravo 3 772 14 5
For 1 of the 6 iterations/combinations, I could use
df1outof6 = dfhuge.loc[(dfhuge["Name"] == "delta") & (dfhuge["lot"] == 2)]
df1outof6
index Name lot col3 col4 col5
657 delta 2 787 11 2
567 delta 2 777 13 4
mean = df1outof6["col4"].mean()
sum = df1outof6["col5"].sum()
...
I want the above operation repeated for all the 6 subsets using a loop.
Final df should be:
finaldf
newcol col4mean col5sum
combination1(delta and 2) 12 6
combination2(delta and 1) 10 1
combination3(delta and 3) 0 0
combination4(bravo and 1) 0 0
combination5(bravo and 2) 0 0
combination6(bravo and 3) 13 8
I need a loop, result of which will be the finaldf. I can't use df.loc to subset each combination because I originally have 200 of them.
CodePudding user response:
Since you did not provide how to create your test df, I made my own and changed column names. Hope it's ok
import numpy as np
import pandas as pd
from itertools import product
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))
rows = []
unique_a = list(df["A"].unique())
unique_b = list(df["B"].unique())
unique_a.sort()
unique_b.sort()
key_error=0
for a_index, b_index in product(unique_a, unique_b):
df_subset = df.loc[(df["A"] == a_index) & (df["B"] == b_index),:]
calc0 = 0
calc1 = 0
if len(df_subset):
calc0 = df_subset["C"].sum()
calc1 = df_subset["D"].mean()
rows.append(((a_index, b_index), calc0, calc1))
final_df = pd.DataFrame(
rows,
columns = ["index combination", "calc1", "calc2"]
).set_index("index combination")
final_df
Initial df
A B C D
0 1 7 3 4
1 6 1 3 9
2 7 9 1 4
3 8 7 4 6
4 0 3 3 3
... ... ... ... ...
95 7 5 5 7
96 0 4 9 0
97 0 3 2 1
98 4 2 9 2
99 6 5 6 6
Final df
calc1 calc2
index combination
(0, 0) 0 0.0
(0, 1) 21 5.0
(0, 2) 1 9.0
(0, 3) 20 3.0
(0, 4) 9 0.0
... ... ...
(9, 5) 1 0.0
(9, 6) 5 7.0
(9, 7) 0 0.0
(9, 8) 0 0.0
(9, 9) 0 0.0
CodePudding user response:
You can create a GroupBy
object to calculate mean and sum for groups based on name and lot. These can then be concatenated to create your final dataframe.
grp = df.groupby(['Name', 'lot'])
finaldf = pd.concat({"col4mean":grp["col4"].mean(),
"col5sum":grp["col5"].sum()}, axis=1).reset_index()
print(finaldf)
Output
Name lot col4mean col5sum
0 bravo 3 13.0 8
1 delta 1 10.0 1
2 delta 2 12.0 6
If you want to include name,lot combinations not included in the data sample, you could prefill the final dataframe with an index and add the columns from there.
grp = df.groupby(['Name', 'lot'])
finaldf = pd.DataFrame(index=pd.MultiIndex.from_product(
[df["Name"].unique(), df["lot"].unique()]))
finaldf["col4mean"] = grp["col4"].mean()
finaldf["col5sum"] = grp["col5"].sum()
finaldf = finaldf.fillna(0)
finaldf["col5sum"] = finaldf["col5sum"].astype(int)
Output
col4mean col5sum
delta 1 10.0 1
2 12.0 6
3 0.0 0
bravo 1 0.0 0
2 0.0 0
3 13.0 8