What is a more efficient way to write this python code?-CodePudding

Sample dataframe:

import pandas as pd
df1 = pd.DataFrame({'date': ['2022-02-25','2022-02-26','2022-02-25'],
                   'attempt_1':[1,0,0],
                   'attempt_2':[0,0,1],
                   'attempt_3':[1,1,0]})

df1

The following code takes 1hr 30mins to process about 252k rows and 75 columns. Unfortunately, that is too long and I don't know if there is any method out there that can replace(?) this to reduce the processing time.

dategrp = set(df1["date"])
dfs_to_dategrp = []

for trial in df1.columns[:]:
    save_sf = {k:{"success":0, "fail":0} for k in dategrp}
    for c in dategrp:
        save_sf[c]["success"]  = len(df1[(df1["date"]==c) & (df1[trial]==1)])
        save_sf[c]["fail"]  = len(df1[(df1["date"]==c) & (df1[trial]==0)])
        new_df1 = pd.DataFrame(save_sf)
        new_df1 = new_df1.T
    dfs_to_dategrp.append(new_df1)
    
result= pd.concat(dfs_to_dategrp, axis = 1)

result

So in the dataframe, the columns are populated randomly with 1s (success) and 0s (fail). This code goes through each rows and columns (this method is too long) to group and total the 1s and 0s to the date of occurence.

Here is the output

Any help given will be appreciated. Thanks.

CodePudding user response：

Just a hunch: If your dataframe df1 looks somehow like

df1 = pd.DataFrame(
    {"date": ["2022-03-09", "2022-03-09", "2022-03-10"],
     "trial_1": [1, 0, 1], "trial_2": [1, 1, 1], "trial_3": [0, 0, 0]}
)
df1.date = pd.to_datetime(df1.date)

        date  t_1  t_2  t_3
0 2022-03-09    1    1    0
1 2022-03-09    0    1    0
2 2022-03-10    1    1    0

then you could use .groupby and .agg do something like

def success(ser): return ser.sum()
def failure(ser): return ser.eq(0).sum()
result = df1.groupby("date").agg(**{
    **{f"{col}_success": (col, success) for col in df1.columns if col != "date"},
    **{f"{col}_failure": (col, failure) for col in df1.columns if col != "date"}
})

to get a result like

            t_1_success  t_2_success  t_3_success  t_1_failure  t_2_failure  \
date                                                                          
2022-03-09            1            2            0            1            0   
2022-03-10            1            1            0            0            0   

            t_3_failure  
date                     
2022-03-09            2  
2022-03-10            1

result = df1.groupby("date").agg(
    **dict(col
           for pair in [[(f"{c}_success", (c, success)), (f"{c}_failure", (c, failure))]
                        for c in df1.columns if c != "date"]
           for col in pair
    )
)

to get

            t_1_success  t_1_failure  t_2_success  t_2_failure  t_3_success  \
date                                                                          
2022-03-09            1            1            2            0            0   
2022-03-10            1            0            1            0            0   

            t_3_failure  
date                     
2022-03-09            2  
2022-03-10            1

Another option:

result_1 = df1.assign(length=1).groupby("date").sum()
result_2 = result_1.rsub(result_1["length"], axis=0)
result = pd.concat(
    [result_1
     .drop(columns="length")
     .rename(columns={c: f"{c}_success" for c in result_1.columns}),
     result_2
     .drop(columns="length")
     .rename(columns={c: f"{c}_failure" for c in result_2.columns})],
    axis=1
)