How to Merge Multilevel Column Dataframes on a Low Level Column-CodePudding

I have several small datasets from a databse displaying genes in different biological pathways. My end goal is to find what are the genes showing up in different datasets. For this reason, i tried to make multilevel dataframes from each dataset and merge them on a single column. However, it looks like it is getting nowhere.

Test samples: https://www.mediafire.com/file/bks9i9unfci0h1f/sample.rar/file

Making multilevel columns:

import pandas as pd

df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv")
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv")
df3 = pd.read_csv("Endocytosis.csv")

title1 = "Bacterial invasion of epithelial cells"
title2 = "C-type lectin receptor signaling pathway"
title3 = "Endocytosis"

final1 = pd.concat({title1: df1}, axis = 1)
final2 = pd.concat({title2: df2}, axis = 1)
final3 = pd.concat({title3: df3}, axis = 1)

I tried to use pandas.merge() to merge the dataframes on "User ID" column:

pd.merge(final1, final2, on = "User ID", how = "outer")

But i get an error. I can not use droplevel(), because i need the title on top. So, i can see which dataset each sample belongs to. Any sugesstion?

CodePudding user response：

Seeing as you want to see which genes appear in different datasets, it sounds like an inner join might be more useful? With User ID as just a single row index.

df1 = pd.read_csv("Bacterial invasion of epithelial cells.csv").set_index('User ID')
df2 = pd.read_csv("C-type lectin receptor signaling pathway.csv").set_index('User ID')
df3 = pd.read_csv("Endocytosis.csv").set_index('User ID')

final1 = pd.concat({"Bacterial invasion of epithelial cells": df1}, axis = 1)
final2 = pd.concat({"C-type lectin receptor signaling pathway": df2}, axis = 1)
final3 = pd.concat({"Endocytosis": df3}, axis = 1)

final1.merge(final3, left_index=True, right_index=True)#.merge(final2, left_index=True, right_index=True)

Output:

    Bacterial invasion of epithelial cells  Endocytosis
    Gene Symbol     Gene Name   Entrez Gene     Score   Gene Symbol     Gene Name   Entrez Gene     Score
User ID                                 
P51636  CAV2    caveolin 2  858     1.3911  CAV2    caveolin 2  858     1.3911
Q03135  CAV1    caveolin 1  857     1.5935  CAV1    caveolin 1  857     1.5935

(I've commented out the second merge operation with final2 as there aren't any overlapping genes between it and the other two, but you can repeat that process with as many datasets as you like.)