How do I subset multiple dataframes based on a shared column?-CodePudding

I want to read into three omics datasets - methylation, cnv, and mRNA expression.

First, I removed duplicated columns and sort by index for all 3 dataframes mrna, meth, and cna separately. These dataframes are then concatenated to form one dataframe common.

Eventually, I want the common dataframe to only contain "Hugo_Symbol" column that are the same in all three dataframes mrna, meth, and cna.

import re

dfs = [mrna, meth, cna]

common = pd.concat(dfs, join='inner')

Here, the common dataframe has a 48625 rows × 348 columns dimension.

Now, I only want to keep the rows in common if the row value is the same across all the three original dataframes mrna, cna, and meth.

keep_col = common.drop_duplicates(keep="first")["Hugo_Symbol"]
common = common[common.set_index(["Hugo_Symbol"]).index.isin(keep_col)]

I'm expecting the newly subsetted common dataframe to have # of rows equal to the length of keep_col

len(keep_col)
48618

However, the subsetted common dataframe instead has a 48625 rows × 348 columns dimension, meaning that none of the rows were removed/subsetted.

Example dataframes:

mrna

	Hugo_Symbol	TCGA-1	TCGA-2	TCGA-3
0	ABC	123	456	789
1	DEF	678	187	456
2	MNO	742	147	147
3	VWX	184	195	268

cna

	Hugo_Symbol	TCGA-1	TCGA-2	TCGA-3
0	DEF	456	123	321
1	ABC	123	456	789
2	GHI	694	284	270
3	JKL	384	843	147

meth

	Hugo_Symbol	TCGA-1	TCGA-2	TCGA-3
0	DEF	456	123	432
1	PQR	285	226	732
2	STU	225	632	532
3	ABC	123	456	789

In this example, the rows with Hugo_Symbol ABC and DEF are retained because all three dataframes have these Hugo_Symbols. Hence, the common dataframe, as (previously) sorted by index, would be: common

	Hugo_Symbol	TCGA-1	TCGA-2	TCGA-3
0	ABC	123	456	789
1	DEF	456	123	321

CodePudding user response：

What exactly would the ideal output look like, is this what you're trying to do? I feel like there's some confusion between columns and rows...

dfs = [mrna, meth]

def do_stuff(df):
    df = pd.DataFrame(df).T
    df.columns = df.iloc[0]
    df = df[1:]
    return df

mrna, meth = map(do_stuff, dfs)

print(pd.concat([mrna, meth], axis=1).dropna())

Output:

Hugo_Symbol        AACS   FSTL1   ELMO2 CREB3L1   RPS11   PNMA1    MMP2  \
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351  0.2216  0.6798   -2.48  0.7912   
TCGA-02-0003-01 -0.5154  0.0896   -1.17  0.1255  0.2374 -3.2629  1.2846   
TCGA-02-0007-01  0.6932 -5.6511 -2.8365  2.0026 -0.6326 -1.3741  -3.437   

Hugo_Symbol      SAMD4A SMARCD3   A4GNT     MEOX2     COX8C     IMPA2  \
TCGA-02-0001-01 -1.4578 -3.8009  3.4868  0.041451  0.981055   0.01858   
TCGA-02-0003-01  -1.474 -2.9891 -0.1511   0.10389  0.989163  0.127843   
TCGA-02-0007-01  -1.047  -4.185  2.1816  0.025519  0.985666  0.018484   

Hugo_Symbol          TTC8   TMEM186    RETSAT     TULP1     TAF15   CCDC88B  \
TCGA-02-0001-01  0.037853  0.858835  0.028956  0.895705  0.024163   0.64503   
TCGA-02-0003-01  0.061488  0.718749  0.037052  0.875983   0.02025  0.338662   
TCGA-02-0007-01  0.084045  0.862945  0.029969  0.930821   0.02238  0.319357   

Hugo_Symbol       EPB41L3  
TCGA-02-0001-01  0.025546  
TCGA-02-0003-01  0.879701  
TCGA-02-0007-01  0.019343

CodePudding user response：

Given that the entire row must be equal, as by my first answer below returns:

import pandas as pd

mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}

cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}


meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}



dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])\
                                         for df in dfs]), columns=dfs[0].columns)
print(f"common: \n{common}")

"""
common:
  Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0         ABC     123     456     789
"""

Given that only 'Hugo_Symbol' column is relevant, below returns: import pandas as pd

mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}

cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}


meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}



dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]

hugoFilter = set.intersection(*[set(df.loc[:, 'Hugo_Symbol'].values.tolist()) for df in dfs])
common = pd.concat(dfs)
common = common[common.Hugo_Symbol.isin(hugoFilter)].drop_duplicates(keep="first", subset='Hugo_Symbol')
print(f"common: \n{common}")

"""
common:
  Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0         ABC     123     456     789
1         DEF     678     187     456
NOTE: make sure the DEF from the corrrect DF is returned, values differ.
"""