Home > Software design >  Create a binary table with presence/absence from groups in pandas
Create a binary table with presence/absence from groups in pandas

Time:12-28

I have a dataframe such as:

Groups Name             
G1     BLOC1_Lgth
G1     Canis_lupus
G1     Rattus_rattus
G2     Apis_mellifera
G2     Apis_terrestris
G3     Canis_lupus
G4     Bombus_bobomus
G4     Rattus_rattus
G4     Canis_lupus
G4     Apis_mellifera
G5     BLOC4_Lgth
G5     Apis_terrestris
G6     BLOC5_Lgth
G7     BLOC2_Lgth
G7     Apis_mellifera
G7     Apis_terrestris 
G7     Canis_lupus
G7     Rattus_rattus
G8     BLOC3_Lgth
G9     BLOC3_Lgth
G9     BLOC1_Lgth
G9     Canis_lupus
G9     Apis_terrestris 

and a list such as :

A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth','BLOC6_Lgth','BLOC7_Lgth']]

And I would like to create a binary table where for each elements in A_list , if a Name is present in the same Group, I fill a column by 1 value.

For instance, we see that BLOC1_Lgth is in G1, where I found two different Name values :

Canis_lupus and Rattus_rattus, so I add the column and add 1 to both such as:

A_list_values    Groups  Canis_lupus Rattus_rattus Apis_mellifera Apis_terrestris 
BLOC1_Lgth       G1      1           1             0              0
 

and zero for the other Name not present in that Group (Apis_mellifera and Apis_terrestris)

Expected results:

A_list_values    Groups  Canis_lupus Rattus_rattus Apis_mellifera Apis_terrestris 
BLOC1_Lgth       G1      1           1             0              0
BLOC2_Lgth       G7      1           1             1              1
BLOC3_Lgth       G8      0           0             0              0
BLOC4_Lgth       G5      0           0             0              1
BLOC5_Lgth       G6      0           0             0              0
BLOC6_Lgth       G9      1           0             0              1
BLOC7_Lgth       G9      1           0             0              1

Here is that dataframe in dict format if it can helps :

{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G4', 7: 'G4', 8: 'G4', 9: 'G4', 10: 'G5', 11: 'G5', 12: 'G6', 13: 'G7', 14: 'G7', 15: 'G7', 16: 'G7', 17: 'G7', 18: 'G8', 19: 'G9', 20: 'G9', 21: 'G9', 22: 'G9'}, 'Name': {0: 'BLOC1_Lgth', 1: 'Canis_lupus', 2: 'Rattus_rattus', 3: 'Apis_mellifera', 4: 'Apis_terrestris', 5: 'Canis_lupus', 6: 'Bombus_bobomus', 7: 'Rattus_rattus', 8: 'Canis_lupus', 9: 'Apis_mellifera', 10: 'BLOC4_Lgth', 11: 'Apis_terrestris', 12: 'BLOC5_Lgth', 13: 'BLOC2_Lgth', 14: 'Apis_mellifera', 15: 'Apis_terrestris', 16: 'Canis_lupus', 17: 'Rattus_rattus', 18: 'BLOC3_Lgth', 19: 'BLOC3_Lgth', 20: 'BLOC1_Lgth', 21: 'Canis_lupus', 22: 'Apis_terrestris'}}

CodePudding user response:

Use:

#typo in Apis_terrestris - removed last space
df = pd.DataFrame({'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G4', 7: 'G4', 8: 'G4', 9: 'G4', 10: 'G5', 11: 'G5', 12: 'G6', 13: 'G7', 14: 'G7', 15: 'G7', 16: 'G7', 17: 'G7', 18: 'G8'}, 'Name  ': {0: 'BLOC1_Lgth', 1: 'Canis_lupus', 2: 'Rattus_rattus', 3: 'Apis_mellifera', 4: 'Apis_terrestris', 5: 'Canis_lupus', 6: 'Bombus_bobomus', 7: 'Rattus_rattus', 8: 'Canis_lupus', 9: 'Apis_mellifera', 10: 'BLOC4_Lgth', 11: 'Apis_terrestris', 12: 'BLOC5_Lgth', 13: 'BLOC2_Lgth', 14: 'Apis_mellifera', 15: 'Apis_terrestris', 16: 'Canis_lupus', 17: 'Rattus_rattus', 18: 'BLOC3_Lgth'}})
  
A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth']

You can create A_list_values column by filter groups with A_list and forward missing values per groups by GroupBy.ffill, then remove NaNs groups by DataFrame.dropna and create indicator columns by get_dummies with aggregate max, last remove BLOC columns:

m = df['Name  '].isin(A_list)
df['A_list_values'] = df['Name  '].where(m).groupby(df['Groups']).ffill()
df = (pd.get_dummies(df.dropna(subset=['A_list_values']), 
                     columns=['Name  '], 
                     prefix='', 
                     prefix_sep='')
        .groupby(['A_list_values','Groups'], as_index=False)
        .max())
df = df.loc[:, ~df.columns.str.startswith('BLOC')]
print (df)
  A_list_values Groups  Apis_mellifera  Apis_terrestris  Canis_lupus  \
0    BLOC1_Lgth     G1               0                0            1   
1    BLOC2_Lgth     G7               1                1            1   
2    BLOC3_Lgth     G8               0                0            0   
3    BLOC4_Lgth     G5               0                1            0   
4    BLOC5_Lgth     G6               0                0            0   

   Rattus_rattus  
0              1  
1              1  
2              0  
3              0  
4              0  

EDIT: You can create lists from matched values of list A_list and then use DataFrame.explode:

A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth']

m = df['Name  '].isin(A_list)

f = lambda x: [y for y in x if y]
df['A_list_values'] = df['Groups'].map(df['Name  '].where(m, '').groupby(df['Groups']).agg(f))

df = df.explode('A_list_values')

df = (pd.get_dummies(df.dropna(subset=['A_list_values']), 
                     columns=['Name  '], 
                     prefix='', 
                     prefix_sep='')
        .groupby(['A_list_values','Groups'], as_index=False)
        .max())
df = df.loc[:, ~df.columns.str.startswith('BLOC')]

print (df)
  A_list_values Groups  Apis_mellifera  Apis_terrestris  Canis_lupus  \
0    BLOC1_Lgth     G1               0                0            1   
1    BLOC1_Lgth     G9               0                1            1   
2    BLOC2_Lgth     G7               1                1            1   
3    BLOC3_Lgth     G8               0                0            0   
4    BLOC3_Lgth     G9               0                1            1   
5    BLOC4_Lgth     G5               0                1            0   
6    BLOC5_Lgth     G6               0                0            0   

   Rattus_rattus  
0              1  
1              0  
2              1  
3              0  
4              0  
5              0  
6              0  
  • Related