Home > database >  Subset df in Python Pandas using two columns in a loop
Subset df in Python Pandas using two columns in a loop

Time:10-18

I have got 1 huge df (over 5 million rows) and 5 columns in jupyter notebook. 'Name' column has 20 unique values and 'lot' column has 10 unique values. So 200 unique combinations of these two columns. I need to subset the df based on the unique combinations of these two columns, make some calculations and return some parts of the calculations in a final df. Final df will have 200 rows, one for each iteration/subset/combination.

For example (with only 2 names and 3 lots = 6 combinations):

    dfhuge
    index      Name   lot   col3 col4 col5
    123        delta   1    786  10   1
    657        delta   2    787  11   2
    567        delta   2    777  13   4
    456        bravo   3    775  12   3
    789        bravo   3    772  14   5

For 1 of the 6 iterations/combinations, I could use

df1outof6 = dfhuge.loc[(dfhuge["Name"] == "delta") & (dfhuge["lot"] == 2)]
df1outof6
   index       Name   lot   col3 col4 col5
    657        delta   2    787  11   2
    567        delta   2    777  13   4
mean = df1outof6["col4"].mean()
sum = df1outof6["col5"].sum()
... 

I want the above operation repeated for all the 6 subsets using a loop.

Final df should be:

    finaldf
    newcol                     col4mean  col5sum  
    combination1(delta and 2)      12     6   
    combination2(delta and 1)      10     1    
    combination3(delta and 3)      0      0   
    combination4(bravo and 1)      0      0    
    combination5(bravo and 2)      0      0
    combination6(bravo and 3)      13     8  

I need a loop, result of which will be the finaldf. I can't use df.loc to subset each combination because I originally have 200 of them.

CodePudding user response:

Since you did not provide how to create your test df, I made my own and changed column names. Hope it's ok

import numpy as np
import pandas as pd
from itertools import product
df = pd.DataFrame(np.random.randint(0,10,size=(100, 4)), columns=list('ABCD'))

rows = []
unique_a = list(df["A"].unique())
unique_b = list(df["B"].unique())
unique_a.sort()
unique_b.sort()
key_error=0
for a_index, b_index in product(unique_a, unique_b):
    df_subset =  df.loc[(df["A"] == a_index) & (df["B"] == b_index),:]
    calc0 = 0
    calc1 = 0
    if len(df_subset):
        calc0 = df_subset["C"].sum()
        calc1 = df_subset["D"].mean()
    rows.append(((a_index, b_index), calc0, calc1))
        
    
final_df = pd.DataFrame(
    rows,  
    columns = ["index combination", "calc1", "calc2"]
).set_index("index combination")

final_df

Initial df

    A   B   C   D
0   1   7   3   4
1   6   1   3   9
2   7   9   1   4
3   8   7   4   6
4   0   3   3   3
... ... ... ... ...
95  7   5   5   7
96  0   4   9   0
97  0   3   2   1
98  4   2   9   2
99  6   5   6   6

Final df

    calc1   calc2
index combination       
(0, 0)  0   0.0
(0, 1)  21  5.0
(0, 2)  1   9.0
(0, 3)  20  3.0
(0, 4)  9   0.0
... ... ...
(9, 5)  1   0.0
(9, 6)  5   7.0
(9, 7)  0   0.0
(9, 8)  0   0.0
(9, 9)  0   0.0

CodePudding user response:

You can create a GroupBy object to calculate mean and sum for groups based on name and lot. These can then be concatenated to create your final dataframe.

grp = df.groupby(['Name', 'lot'])
finaldf = pd.concat({"col4mean":grp["col4"].mean(),
    "col5sum":grp["col5"].sum()}, axis=1).reset_index()
print(finaldf)

Output

    Name  lot  col4mean  col5sum
0  bravo    3      13.0        8
1  delta    1      10.0        1
2  delta    2      12.0        6

If you want to include name,lot combinations not included in the data sample, you could prefill the final dataframe with an index and add the columns from there.

grp = df.groupby(['Name', 'lot'])
finaldf = pd.DataFrame(index=pd.MultiIndex.from_product(
         [df["Name"].unique(), df["lot"].unique()]))         
finaldf["col4mean"] = grp["col4"].mean()
finaldf["col5sum"] = grp["col5"].sum()
finaldf = finaldf.fillna(0)
finaldf["col5sum"] = finaldf["col5sum"].astype(int)

Output

         col4mean  col5sum
delta 1      10.0        1
      2      12.0        6
      3       0.0        0
bravo 1       0.0        0
      2       0.0        0
      3      13.0        8
  • Related