I have a dataframe such as:
Groups Name
G1 BLOC1_Lgth
G1 Canis_lupus
G1 Rattus_rattus
G2 Apis_mellifera
G2 Apis_terrestris
G3 Canis_lupus
G4 Bombus_bobomus
G4 Rattus_rattus
G4 Canis_lupus
G4 Apis_mellifera
G5 BLOC4_Lgth
G5 Apis_terrestris
G6 BLOC5_Lgth
G7 BLOC2_Lgth
G7 Apis_mellifera
G7 Apis_terrestris
G7 Canis_lupus
G7 Rattus_rattus
G8 BLOC3_Lgth
G9 BLOC3_Lgth
G9 BLOC1_Lgth
G9 Canis_lupus
G9 Apis_terrestris
and a list such as :
A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth','BLOC6_Lgth','BLOC7_Lgth']]
And I would like to create a binary table where for each elements in A_list
, if a Name is present in the same Group, I fill a column by 1 value.
For instance, we see that BLOC1_Lgth
is in G1
, where I found two different Name
values :
Canis_lupus
and Rattus_rattus
, so I add the column and add 1 to both such as:
A_list_values Groups Canis_lupus Rattus_rattus Apis_mellifera Apis_terrestris
BLOC1_Lgth G1 1 1 0 0
and zero for the other Name not present in that Group (Apis_mellifera
and Apis_terrestris
)
Expected results:
A_list_values Groups Canis_lupus Rattus_rattus Apis_mellifera Apis_terrestris
BLOC1_Lgth G1 1 1 0 0
BLOC2_Lgth G7 1 1 1 1
BLOC3_Lgth G8 0 0 0 0
BLOC4_Lgth G5 0 0 0 1
BLOC5_Lgth G6 0 0 0 0
BLOC6_Lgth G9 1 0 0 1
BLOC7_Lgth G9 1 0 0 1
Here is that dataframe in dict format if it can helps :
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G4', 7: 'G4', 8: 'G4', 9: 'G4', 10: 'G5', 11: 'G5', 12: 'G6', 13: 'G7', 14: 'G7', 15: 'G7', 16: 'G7', 17: 'G7', 18: 'G8', 19: 'G9', 20: 'G9', 21: 'G9', 22: 'G9'}, 'Name': {0: 'BLOC1_Lgth', 1: 'Canis_lupus', 2: 'Rattus_rattus', 3: 'Apis_mellifera', 4: 'Apis_terrestris', 5: 'Canis_lupus', 6: 'Bombus_bobomus', 7: 'Rattus_rattus', 8: 'Canis_lupus', 9: 'Apis_mellifera', 10: 'BLOC4_Lgth', 11: 'Apis_terrestris', 12: 'BLOC5_Lgth', 13: 'BLOC2_Lgth', 14: 'Apis_mellifera', 15: 'Apis_terrestris', 16: 'Canis_lupus', 17: 'Rattus_rattus', 18: 'BLOC3_Lgth', 19: 'BLOC3_Lgth', 20: 'BLOC1_Lgth', 21: 'Canis_lupus', 22: 'Apis_terrestris'}}
CodePudding user response:
Use:
#typo in Apis_terrestris - removed last space
df = pd.DataFrame({'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G2', 4: 'G2', 5: 'G3', 6: 'G4', 7: 'G4', 8: 'G4', 9: 'G4', 10: 'G5', 11: 'G5', 12: 'G6', 13: 'G7', 14: 'G7', 15: 'G7', 16: 'G7', 17: 'G7', 18: 'G8'}, 'Name ': {0: 'BLOC1_Lgth', 1: 'Canis_lupus', 2: 'Rattus_rattus', 3: 'Apis_mellifera', 4: 'Apis_terrestris', 5: 'Canis_lupus', 6: 'Bombus_bobomus', 7: 'Rattus_rattus', 8: 'Canis_lupus', 9: 'Apis_mellifera', 10: 'BLOC4_Lgth', 11: 'Apis_terrestris', 12: 'BLOC5_Lgth', 13: 'BLOC2_Lgth', 14: 'Apis_mellifera', 15: 'Apis_terrestris', 16: 'Canis_lupus', 17: 'Rattus_rattus', 18: 'BLOC3_Lgth'}})
A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth']
You can create A_list_values
column by filter groups with A_list
and forward missing values per groups by GroupBy.ffill
, then remove NaN
s groups by DataFrame.dropna
and create indicator columns by get_dummies
with aggregate max
, last remove BLOC
columns:
m = df['Name '].isin(A_list)
df['A_list_values'] = df['Name '].where(m).groupby(df['Groups']).ffill()
df = (pd.get_dummies(df.dropna(subset=['A_list_values']),
columns=['Name '],
prefix='',
prefix_sep='')
.groupby(['A_list_values','Groups'], as_index=False)
.max())
df = df.loc[:, ~df.columns.str.startswith('BLOC')]
print (df)
A_list_values Groups Apis_mellifera Apis_terrestris Canis_lupus \
0 BLOC1_Lgth G1 0 0 1
1 BLOC2_Lgth G7 1 1 1
2 BLOC3_Lgth G8 0 0 0
3 BLOC4_Lgth G5 0 1 0
4 BLOC5_Lgth G6 0 0 0
Rattus_rattus
0 1
1 1
2 0
3 0
4 0
EDIT: You can create lists from matched values of list A_list
and then use DataFrame.explode
:
A_list=['BLOC1_Lgth','BLOC2_Lgth','BLOC3_Lgth','BLOC4_Lgth','BLOC5_Lgth']
m = df['Name '].isin(A_list)
f = lambda x: [y for y in x if y]
df['A_list_values'] = df['Groups'].map(df['Name '].where(m, '').groupby(df['Groups']).agg(f))
df = df.explode('A_list_values')
df = (pd.get_dummies(df.dropna(subset=['A_list_values']),
columns=['Name '],
prefix='',
prefix_sep='')
.groupby(['A_list_values','Groups'], as_index=False)
.max())
df = df.loc[:, ~df.columns.str.startswith('BLOC')]
print (df)
A_list_values Groups Apis_mellifera Apis_terrestris Canis_lupus \
0 BLOC1_Lgth G1 0 0 1
1 BLOC1_Lgth G9 0 1 1
2 BLOC2_Lgth G7 1 1 1
3 BLOC3_Lgth G8 0 0 0
4 BLOC3_Lgth G9 0 1 1
5 BLOC4_Lgth G5 0 1 0
6 BLOC5_Lgth G6 0 0 0
Rattus_rattus
0 1
1 0
2 1
3 0
4 0
5 0
6 0