Create New Columns Using Multiple Conditions And Time Difference-CodePudding

I have the following dataframe with a tricky problem:

df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], 
                              [181, 175, 172, 165, 150]],
           'days_since_gym': [[0, 87, 174, 205, 279], 
                              [43, 171, 241, 273, 300]]})

print(df)
              weight               days_since_gym
0  [200, 190, 188, 180, 170]    [0, 91, 174, 205, 279]
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]

I have to make 4 columns (0-90 days, 91-180 days, 181-270 days, 270-360 days) based on the following conditions:

1) If there are multiple weights in a specific time duration, get the maximum weight in that time duration column.

2) If no weight is present in that time duration, the value for that duration would be 0.

Desired output:

             weight                 days_since_gym     0-90   91-180   181-270  271-360

0  [200, 190, 188, 180, 170]    [0, 87, 174, 205, 279] 200      188       180       170
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]  0       181       172       165

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

CodePudding user response：

You can write a custom function that takes in a list of weights, start day, end day — then apply this function rowwise to create each of your new columns using the pandas apply function. If you haven't used apply before, the basic structure is something like: df.apply(lambda x: custom_function(...), axis=1). The argument axis=1 ensures your custom function is applied rowwise.

Since the names of your new columns are also the start and end days, you can loop through these start and end day ranges.

Also I noticed that in your question there seem to be some mismatches between the DataFrame you created and your desired output, so I am taking the the desired output to be the DataFrame.

import numpy as np
import pandas as pd

df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], 
                              [181, 175, 172, 165, 150]],
           'days_since_gym': [[0, 87, 174, 205, 279], 
                              [93, 171, 241, 273, 300]]})

def return_max_weight(weights, days, start_day, end_day):
    ## get the indices where weights are between start and end days
    days = np.array(days)
    weights_idx = list(np.where((days >= start_day) & (days <= end_day))[0])
    if len(weights_idx) == 0:
        return 0
    else:
        weight_between_start_and_end = [weights[idx] for idx in weights_idx]
        print(weight_between_start_and_end)
        return max(weight_between_start_and_end)

for start_day, end_day in zip([0, 91, 181, 271],[90, 180, 270, 360]):
    col_name = f"{start_day}-{end_day}"
    df[col_name] = df[['weight','days_since_gym']].apply(
        lambda x: return_max_weight(x['weight'], x['days_since_gym'], start_day, end_day),
        axis=1
    )

Output:

>>> df
                      weight            days_since_gym  0-90  91-180  181-270  271-360
0  [200, 190, 188, 180, 170]    [0, 87, 174, 205, 279]   200     188      180      170
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]     0     181      172      165