Home > Blockchain >  Is there a better way to merge rows and delete duplicates in a pandas dataframe, possibly using grou
Is there a better way to merge rows and delete duplicates in a pandas dataframe, possibly using grou

Time:06-07

I have a pandas dataframe that contains some data as shown below.

Sample Protein Name Peptide Calc. Mass Score sample_1 Score sample_2 Score sample_n
sample_1 Name_1 KIITHPNFNGNTLDNDIMLIKL 2283.1802 128.3
sample_2 Name_1 KIITHPNFNGNTLDNDIMLIKL 2283.1802 202.5
sample_1 Name_2 KVQQYRV 693.3678 52.2
sample_1 Name_3 RLCSGNPDSVETLEQPAVPILSALPKQ 2635.3647 305.2
sample_2 Name_3 RLCSGNPDSVETLEQPAVPILSALPKQ 2635.3647 450.6
sample_N Name_3 RLCSGNPDSVETLEQPAVPILSALPKQ 2635.3647 362.7

I want to copy sample scores of matching peptides into one row and remove the duplicates. I wrote this set of loops and it works but I'm wondering if there's a better way. One caveat is that the number of samples run will be different on each day so I need to be able to adapt to different numbers of columns.

def finalize(final_df):
    for item in list_pep:
        Row = (final_df.index[final_df["Peptide"] == str(item)].tolist())
        for i in Row:
            try:
                if final_df.iloc[i 1, 2] == final_df.iloc[i, 2]:
                    for item in sam_col:
                        value = final_df.at[i, item]
                        isNaN = np.isnan(value)
                        if isNaN == True:
                            pass
                        else:
                            final_df.at[i 1, item] = final_df.at[i, item]
                else:
                    pass
            except IndexError:
                pass
    final_df = final_df.drop_duplicates("Peptide", keep='last')
    return final_df

Here is my desired output based on the above table. I didn't include the part to delete the sample column.

Protein Name Peptide Calc. Mass Score sample_1 Score sample_2 Score sample_n
Name_1 KIITHPNFNGNTLDNDIMLIKL 2283.1802 128.3 202.5
Name_2 KVQQYRV 693.3678 52.2
Name_3 RLCSGNPDSVETLEQPAVPILSALPKQ 2635.3647 305.2 450.6 362.7

CodePudding user response:

Don't use a complicated custom function, simply use:

final_df = df.groupby(['Protein Name'], as_index=False).agg('first')

output:

  Protein Name    Sample                      Peptide  Calc. Mass Score sample_1 Score sample_2  Score sample_n
0       Name_1  sample_1       KIITHPNFNGNTLDNDIMLIKL   2283.1802          128.3          202.5             NaN
1       Name_2  sample_1                      KVQQYRV    693.3678           52.2            NaN             NaN
2       Name_3  sample_1  RLCSGNPDSVETLEQPAVPILSALPKQ   2635.3647          305.2          450.6           362.7
  • Related