I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.
File one:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3
File two:
seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY
Output:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3
I have tried:
df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')
output:
seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0
Does anyone have any advice how to solve this?
CodePudding user response:
You can use the isin() pandas method, code shall looks as follow :
df1.loc[df1['seq'].isin(df2['seq'])]
Assuming, both objects are pandas dataframe and 'seq' is a column.
CodePudding user response:
Do you want to merge two dataframes, or just take subset of the first dataframe according to which id
is included in the second dataframe (by seq
)? Anyway, this gives the required result.
df1 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CTVGPPNNEQFF',
'CTVGPPNNERFF',
'CASRGEAAGFYEQYF',
'RASRGEAAGFYEQYF',
'CASRGGAAGFYEQYF',
'CASSDLILYYEQYF',
'CASSDLILYYTQYF',
'CASSGSYEQYF',
'CASSGSYEQYY'
],
'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})
df2 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CASRGEAAGFYEQYF',
'CASSGSYEQYY'
]
})
df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]
Explanation: df1['id'][df1['seq'].isin(df2['seq'])]
takes those values of id
from df1
that contain at least one seq
that is included in df2
. Then all rows with those values of id
are taken from df1
.