Home > Software design >  Pandas - Creating new column based on dynamic conditions from lists
Pandas - Creating new column based on dynamic conditions from lists

Time:09-22

I have two lists to start with:

delta = ['1','5']
taxa = ['2','3','4']

My dataframe will look like :

data = { 'id': [101,102,103,104,105],
       '1_srcA': ['a', 'b','c', 'd', 'g'] ,
       '1_srcB': ['a', 'b','c', 'd', 'e'] ,
       '2_srcA': ['g', 'b','f', 'd', 'e'] ,
       '2_srcB': ['a', 'b','c', 'd', 'e'] ,
       '3_srcA': ['a', 'b','c', 'd', 'e'] ,
       '3_srcB': ['a', 'b','1', 'd', 'm'] ,
       '4_srcA': ['a', 'b','c', 'd', 'e'] ,
       '4_srcB': ['a', 'b','c', 'd', 'e'] ,
       '5_srcA': ['a', 'b','c', 'd', 'e'] ,
       '5_srcB': ['m', 'b','c', 'd', 'e'] }
df = pd.DataFrame(data)
df

enter image description here

I have to do two types of checks on this dataframe. Say, Delta check and Taxa checks. For Delta checks, based on list delta = ['1','5'] I have to compare 1_srcA vs 1_srcB and 5_srcA vs 5_srcB since '1' is in 1_srcA ,1_srcB and '5' is in 5_srcA, 5_srcB . If the values differ, I have to populate 2. For tax checks (based on values from taxa list), it should be 1. If no difference, it is 0.

So, this comparison has to happen on all the rows. df is generated based on merge of two dataframes. so, there will be only two cols which has '1' in it, two cols which has '2' in it and so on.

Conditions I have to check:

  1. I need to check if columns containing values from delta list differs. If yes, I will populate 2.
  2. need to check if columns containing values from taxa list differs. If yes, I will populate 1.
  3. If condition 1 and condition 2 are satisfied, then populate 2.
  4. If none of the conditions satisfied, then 0.

So, my output should look like:

Expected Output

The code I tried:

df_cols_ = df.columns.tolist()[1:]
conditions = []
res = {}
for i,col in enumerate(df_cols_):
    if (i == 0) or (i%2 == 0) :
        continue
    var = 'cond_' str(i)
    for del_col in delta:
        if del_col in col:
            var = var   '_F'
            break
    print (var)
    cond = f"df.iloc[:, {i}] != df.iloc[:, {i 1}]"
    res[var] = cond
    conditions.append(cond)

The res dict will look like the below. But how can i use the condition to populate?

enter image description here

Is the any optimal solution the resultant dataframe can be derived? Thanks.

CodePudding user response:

Create helper function for filter values by DataFrame.filter and compare them for not equal, then use np.logical_or.reduce for processing list of boolean masks to one mask and pass to numpy.select:

delta = ['1','5']
taxa = ['2','3','4']

def f(x):
    df1 = df.filter(like=x)
    return df1.iloc[:, 0].ne(df1.iloc[:, 1])

d = np.logical_or.reduce([f(x) for x in delta])
print (d)
[ True False False False  True]

t = np.logical_or.reduce([f(x) for x in taxa])
print (t)
[ True False  True False  True]


df['res'] = np.select([d, t], [2, 1], default=0)
print (df)
    id 1_srcA 1_srcB 2_srcA 2_srcB 3_srcA 3_srcB 4_srcA 4_srcB 5_srcA 5_srcB  \
0  101      a      a      g      a      a      a      a      a      a      m   
1  102      b      b      b      b      b      b      b      b      b      b   
2  103      c      c      f      c      c      1      c      c      c      c   
3  104      d      d      d      d      d      d      d      d      d      d   
4  105      g      e      e      e      e      m      e      e      e      e   

   res  
0    2  
1    0  
2    1  
3    0  
4    2 
  • Related