Home > Software design >  Generate a new column based on other columns' value
Generate a new column based on other columns' value

Time:09-13

here is my sample data input and output:

df=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5]})

df1=pd.DataFrame({'A_flag': [1, 1,1], 'B_flag': [1, 1,0],'C_flag': [0, 1,0],'A_value': [5, 3,7], 'B_value': [2, 7,4],'C_value': [4, 2,5], 'Final':[3.5,3,7]})

I want to generate another column called 'Final' conditional on A_flag, B_flag and C_flag:

(a) If number of three columns equal to 1 is 3, then 'Final'=median of (A_value, B_value, C_value)

(b) If the number of satisfied conditions is 2, then 'Final'= mean of those two

(c) If the number is 1, the 'Final' = that one

For example, in row 1, A_flag=1 and B_flag =1, 'Final'=A_value B_value/2=5 2/2=3.5 in row 2, all three flags are 1 so 'Final'= median of (3,7,2) =3 in row 3, only A_flag=1, so 'Final'=A_value=7

I tried the following:

df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==3, "Final"]= df[['A_flag','B_flag','C_flag']].median(axis=1)

df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==2, "Final"]=
df.loc[df[['A_flag','B_flag','C_flag']].eq(1).sum(axis=1)==1, "Final"]=  

I don't know how to subset the columns that for the second and third scenarios.

CodePudding user response:

Assuming the order of flag and value columns match, you can first filter the flag and value like columns then mask the values in value columns where flag is 0, then calculate median along axis=1

flag = df.filter(like='_flag')
value = df.filter(like='_value')

df['median'] = value.mask(flag.eq(0).to_numpy()).median(1)

   A_flag  B_flag  C_flag  A_value  B_value  C_value  median
0       1       1       0        5        2        4     3.5
1       1       1       1        3        7        2     3.0
2       1       0       0        7        4        5     7.0

CodePudding user response:

With numpy:

flags = df[["A_flag", "B_flag", "C_flag"]].to_numpy()
values = df[["A_value", "B_value", "C_value"]].to_numpy()

# Sort each row so that the 0 flags appear first
index = np.argsort(flags)
flags = np.take_along_axis(flags, index, axis=1)
# Rearrange the values to match the flags
values = np.take_along_axis(values, index, axis=1)

# Result
np.select(
    [
        flags[:, 0] == 1, # when all flags are 1
        flags[:, 1] == 1, # when two flags are 1
        flags[:, 2] == 1, # when one flag is 1
    ],
    [
        np.quantile(values, 0.5, axis=1), # median all of 3 values
        np.mean(values[:, -2:], axis=1),  # mean of the two 1-flag
        values[:, 2],                     # value of the 1-flag
    ],
    default=np.nan
)

CodePudding user response:

When dealing with functions and dataframe, usually the easiest way to go is defining a function and then calling that function to the dataframe either by iterating over the columns or the rows. I think in your case this might work:

import pandas as pd

df = pd.DataFrame(
    {
        "A_flag": [1, 1, 1],
        "B_flag": [1, 1, 0],
        "C_flag": [0, 1, 0],
        "A_value": [5, 3, 7],
        "B_value": [2, 7, 4],
        "C_value": [4, 2, 5],
    }
)

def make_final_column(row):
    flags = [(row['A_flag'], row['A_value']), (row['B_flag'], row['B_value']), (row['C_flag'], row['C_value'])]
    met_condition = [row[1] for row in flags if row[0] == 1]
    return sum(met_condition) / len(met_condition)


df["Final"] = df.apply(make_final_column, axis=1)
df

CodePudding user response:

Quite interesting solutions already. I have used a masked approach.

Explanation: So, with the flag given already it becomes easy to find which values are important just by multiplying by the flag. There after mask the values which are zero in respective rows and find median over the axis.

>>> import numpy as np 
>>> t_arr = np.array((df.A_flag * df.A_value, df.B_flag * df.B_value, df.C_flag * df.C_value)).T

>>> maskArr = np.ma.masked_array(t_arr, mask=x==0) 

>>> df["Final"] = np.ma.median(maskArr, axis=1)

>>> df

A_flag  B_flag  C_flag  A_value     B_value     C_value     Final
0   1     1       0       5           2           4          3.5
1   1     1       1       3           7           2          3.0
2   1     0       0       7           4           5          7.0
  • Related