Is there a better way to merge rows and delete duplicates in a pandas dataframe, possibly using grou-CodePudding

I have a pandas dataframe that contains some data as shown below.

Sample	Protein Name	Peptide	Calc. Mass	Score sample_1	Score sample_2	Score sample_n
sample_1	Name_1	KIITHPNFNGNTLDNDIMLIKL	2283.1802	128.3
sample_2	Name_1	KIITHPNFNGNTLDNDIMLIKL	2283.1802		202.5
sample_1	Name_2	KVQQYRV	693.3678	52.2
sample_1	Name_3	RLCSGNPDSVETLEQPAVPILSALPKQ	2635.3647	305.2
sample_2	Name_3	RLCSGNPDSVETLEQPAVPILSALPKQ	2635.3647		450.6
sample_N	Name_3	RLCSGNPDSVETLEQPAVPILSALPKQ	2635.3647			362.7

I want to copy sample scores of matching peptides into one row and remove the duplicates. I wrote this set of loops and it works but I'm wondering if there's a better way. One caveat is that the number of samples run will be different on each day so I need to be able to adapt to different numbers of columns.

def finalize(final_df):
    for item in list_pep:
        Row = (final_df.index[final_df["Peptide"] == str(item)].tolist())
        for i in Row:
            try:
                if final_df.iloc[i 1, 2] == final_df.iloc[i, 2]:
                    for item in sam_col:
                        value = final_df.at[i, item]
                        isNaN = np.isnan(value)
                        if isNaN == True:
                            pass
                        else:
                            final_df.at[i 1, item] = final_df.at[i, item]
                else:
                    pass
            except IndexError:
                pass
    final_df = final_df.drop_duplicates("Peptide", keep='last')
    return final_df

Here is my desired output based on the above table. I didn't include the part to delete the sample column.

Protein Name	Peptide	Calc. Mass	Score sample_1	Score sample_2	Score sample_n
Name_1	KIITHPNFNGNTLDNDIMLIKL	2283.1802	128.3	202.5
Name_2	KVQQYRV	693.3678	52.2
Name_3	RLCSGNPDSVETLEQPAVPILSALPKQ	2635.3647	305.2	450.6	362.7

CodePudding user response：

Don't use a complicated custom function, simply use:

final_df = df.groupby(['Protein Name'], as_index=False).agg('first')

output:

  Protein Name    Sample                      Peptide  Calc. Mass Score sample_1 Score sample_2  Score sample_n
0       Name_1  sample_1       KIITHPNFNGNTLDNDIMLIKL   2283.1802          128.3          202.5             NaN
1       Name_2  sample_1                      KVQQYRV    693.3678           52.2            NaN             NaN
2       Name_3  sample_1  RLCSGNPDSVETLEQPAVPILSALPKQ   2635.3647          305.2          450.6           362.7