I have several small datasets from a databse displaying genes in different biological pathways. My end goal is to find what are the genes showing up in different datasets. For this reason, i tried to make multilevel dataframes from each dataset and merge them on a single column. However, it looks like it is getting nowhere.
Test samples: https://www.mediafire.com/file/bks9i9unfci0h1f/sample.rar/file
Making multilevel columns:
import pandas as pd
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv")
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv")
df3 = pd.read_csv("Endocytosis.csv")
title1 = "Bacterial invasion of epithelial cells"
title2 = "C-type lectin receptor signaling pathway"
title3 = "Endocytosis"
final1 = pd.concat({title1: df1}, axis = 1)
final2 = pd.concat({title2: df2}, axis = 1)
final3 = pd.concat({title3: df3}, axis = 1)
I tried to use pandas.merge() to merge the dataframes on "User ID" column:
pd.merge(final1, final2, on = "User ID", how = "outer")
But i get an error. I can not use droplevel(), because i need the title on top. So, i can see which dataset each sample belongs to. Any sugesstion?
CodePudding user response:
Seeing as you want to see which genes appear in different datasets, it sounds like an inner join might be more useful? With User ID as just a single row index.
df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv").set_index('User ID')
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv").set_index('User ID')
df3 = pd.read_csv("Endocytosis.csv").set_index('User ID')
final1 = pd.concat({"Bacterial invasion of epithelial cells": df1}, axis = 1)
final2 = pd.concat({"C-type lectin receptor signaling pathway": df2}, axis = 1)
final3 = pd.concat({"Endocytosis": df3}, axis = 1)
final1.merge(final3, left_index=True, right_index=True)#.merge(final2, left_index=True, right_index=True)
Output:
Bacterial invasion of epithelial cells Endocytosis
Gene Symbol Gene Name Entrez Gene Score Gene Symbol Gene Name Entrez Gene Score
User ID
P51636 CAV2 caveolin 2 858 1.3911 CAV2 caveolin 2 858 1.3911
Q03135 CAV1 caveolin 1 857 1.5935 CAV1 caveolin 1 857 1.5935
(I've commented out the second merge operation with final2
as there aren't any overlapping genes between it and the other two, but you can repeat that process with as many datasets as you like.)