pandas merge dataframes: keep all rows when one of the columns matches another dataframe-CodePudding

I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.

File one:

seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3

File two:

seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY

Output:

seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3

I have tried:

df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')

output:

seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0

Does anyone have any advice how to solve this?

CodePudding user response：

You can use the isin() pandas method, code shall looks as follow :

df1.loc[df1['seq'].isin(df2['seq'])]

Assuming, both objects are pandas dataframe and 'seq' is a column.

CodePudding user response：

Do you want to merge two dataframes, or just take subset of the first dataframe according to which id is included in the second dataframe (by seq)? Anyway, this gives the required result.

df1 = pd.DataFrame({
    'seq': [
        'CSVGPPNNEQFF',
        'CTVGPPNNEQFF',
        'CTVGPPNNERFF',
        'CASRGEAAGFYEQYF',
        'RASRGEAAGFYEQYF',
        'CASRGGAAGFYEQYF',
        'CASSDLILYYEQYF',
        'CASSDLILYYTQYF',
        'CASSGSYEQYF',
        'CASSGSYEQYY'
    ],
    'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})

df2 = pd.DataFrame({
    'seq': [
        'CSVGPPNNEQFF',
        'CASRGEAAGFYEQYF',
        'CASSGSYEQYY'
    ]
})

df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]

Explanation: df1['id'][df1['seq'].isin(df2['seq'])] takes those values of id from df1 that contain at least one seq that is included in df2. Then all rows with those values of id are taken from df1.