I have a pandas dataframe that contains some data as shown below.
Sample | Protein Name | Peptide | Calc. Mass | Score sample_1 | Score sample_2 | Score sample_n |
---|---|---|---|---|---|---|
sample_1 | Name_1 | KIITHPNFNGNTLDNDIMLIKL | 2283.1802 | 128.3 | ||
sample_2 | Name_1 | KIITHPNFNGNTLDNDIMLIKL | 2283.1802 | 202.5 | ||
sample_1 | Name_2 | KVQQYRV | 693.3678 | 52.2 | ||
sample_1 | Name_3 | RLCSGNPDSVETLEQPAVPILSALPKQ | 2635.3647 | 305.2 | ||
sample_2 | Name_3 | RLCSGNPDSVETLEQPAVPILSALPKQ | 2635.3647 | 450.6 | ||
sample_N | Name_3 | RLCSGNPDSVETLEQPAVPILSALPKQ | 2635.3647 | 362.7 |
I want to copy sample scores of matching peptides into one row and remove the duplicates. I wrote this set of loops and it works but I'm wondering if there's a better way. One caveat is that the number of samples run will be different on each day so I need to be able to adapt to different numbers of columns.
def finalize(final_df):
for item in list_pep:
Row = (final_df.index[final_df["Peptide"] == str(item)].tolist())
for i in Row:
try:
if final_df.iloc[i 1, 2] == final_df.iloc[i, 2]:
for item in sam_col:
value = final_df.at[i, item]
isNaN = np.isnan(value)
if isNaN == True:
pass
else:
final_df.at[i 1, item] = final_df.at[i, item]
else:
pass
except IndexError:
pass
final_df = final_df.drop_duplicates("Peptide", keep='last')
return final_df
Here is my desired output based on the above table. I didn't include the part to delete the sample column.
Protein Name | Peptide | Calc. Mass | Score sample_1 | Score sample_2 | Score sample_n |
---|---|---|---|---|---|
Name_1 | KIITHPNFNGNTLDNDIMLIKL | 2283.1802 | 128.3 | 202.5 | |
Name_2 | KVQQYRV | 693.3678 | 52.2 | ||
Name_3 | RLCSGNPDSVETLEQPAVPILSALPKQ | 2635.3647 | 305.2 | 450.6 | 362.7 |
CodePudding user response:
Don't use a complicated custom function, simply use:
final_df = df.groupby(['Protein Name'], as_index=False).agg('first')
output:
Protein Name Sample Peptide Calc. Mass Score sample_1 Score sample_2 Score sample_n
0 Name_1 sample_1 KIITHPNFNGNTLDNDIMLIKL 2283.1802 128.3 202.5 NaN
1 Name_2 sample_1 KVQQYRV 693.3678 52.2 NaN NaN
2 Name_3 sample_1 RLCSGNPDSVETLEQPAVPILSALPKQ 2635.3647 305.2 450.6 362.7