I currently have two large dataframes which I will condense for the purpose of this questions. Dataframe 1 has a list of probesets and transcripts. I must match the transcripts to the corresponding transcripts in dataframe 2 and get the data for each subjet as you can see below:
Probeset Transcript
0 1554784_at ENST00000547702
1 NaN ENST00000547849
2 212983_at ENST00000311189
3 NaN ENST00000397596
4 1566643_a_at ENST00000587894
and then the following dataframe where I need to match the transcripts:
transcript_id phchp230v2 phchp273v3 phchp367v3 phchp201v2
0 ENST00000547702 0.000000 0.000000 0.000000 0.000000
1 ENST00000547849 0.000000 0.000000 0.000000 0.000000
2 ENST00000311189 0.336418 0.044721 0.155847 1.676620
3 ENST00000397596 0.027106 0.016806 0.014509 0.022015
4 ENST00000587894 0.048200 0.089618 0.046528 0.000000
What I need to do is match the transcripts that are in dataframe 1 with the transcripts in dataframe 2 and get the data that is in each transcript for that specific subject that is at the top of dataframe 2. However, there is a lot of data in each of these so I would have to search for the transcripts and the corresponding data for that transcript as they are in just in order how I showcased. The expected output is as shows:
Probeset Transcript phchp230v2 phchp273v3 phchp367v3 phchp201v2
0 1554784_at ENST00000547702 0.000000 0.000000 0.000000 0.000000
1 NaN ENST00000547849 0.000000 0.000000 0.000000 0.000000
2 212983_at ENST00000311189 0.336418 0.044721 0.155847 1.676620
3 NaN ENST00000397596 0.027106 0.016806 0.014509 0.022015
4 1566643_a_at ENST00000587894 0.048200 0.089618 0.046528 0.000000
I'm not sure how to go about finding the transcripts and then placing the specific data found with the correct subject headers as well, thank you all in advance!
CodePudding user response:
You can merge them by .merge()
, as follows:
(Assuming the first/second dataframes are called df1
/df2
respectively)
df_out = df1.merge(df2.rename({'transcript_id': 'Transcript'}, axis=1), on='Transcript', how='left')
Result:
print(df_out)
Probeset Transcript phchp230v2 phchp273v3 phchp367v3 phchp201v2
0 1554784_at ENST00000547702 0.000000 0.000000 0.000000 0.000000
1 NaN ENST00000547849 0.000000 0.000000 0.000000 0.000000
2 212983_at ENST00000311189 0.336418 0.044721 0.155847 1.676620
3 NaN ENST00000397596 0.027106 0.016806 0.014509 0.022015
4 1566643_a_at ENST00000587894 0.048200 0.089618 0.046528 0.000000