How do I create a new dataframe that keeps only the rows that are the same in multiple dataframes?-CodePudding

I want to read into three omics datasets - methylation, cnv, and mRNA expression.

First, I removed duplicated columns and sort by index for all 3 dataframes mrna, meth, and cna separately. These dataframes are then concatenated to form one dataframe common.

Eventually, I want the common dataframe to only contain rows that are the same in all three dataframes mrna, meth, and cna.

import re

dfs = [mrna, meth, cna]

common = pd.concat(dfs, join='inner')

Now, I only want to keep the rows in common if the row value is the same across all the three original dataframes mrna, cna, and meth. Using the answer provided here (pandas - filter dataframe by another dataframe by row elements), this is my attempt:

# Keep only rows that are common in all three dataframes
mrna_keys = list(mrna.columns.values)
cna_keys = list(cna.columns.values)
meth_keys = list(meth.columns.values)

i_common_mrna = common.set_index(mrna_keys).index 
i_common_cna = common.set_index(cna_keys).index
i_common_meth = common.set_index(meth_keys).index

i_mrna_common = mrna.set_index(mrna_keys).index
i_cna_common = cna.set_index(cna_keys).index
i_meth_common = meth.set_index(meth_keys).index

common = common[~i_common_mrna.isin(i_mrna_common) & ~i_common_cna.isin(i_cna_common) & ~i_common_meth.isin(i_meth_common)]
common

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-215-c824a7d76e25> in <module>()
     15 meth_keys = list(meth.columns.values)
     16 
---> 17 i_common_mrna = common.set_index(mrna_keys).index
     18 i_common_cna = common.set_index(cna_keys).index
     19 i_common_meth = common.set_index(meth_keys).index

1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
   5449 
   5450         if missing:
-> 5451             raise KeyError(f"None of {missing} are in the columns")
   5452 
   5453         if inplace:

Second attempt: Based on @lmielke's suggestion, I've tried the following alternative.

import re

dfs = [mrna, meth, cna]

# Keep only rows that are common in all three dataframes
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()]) for df in dfs]), columns=dfs[0].columns)
common

However, it returns 0 rows.

Example of the first few rows/columns of mrna dataframe (as a dictionary):

{'Hugo_Symbol': {0: 'AACS',
  1: 'FSTL1',
  2: 'ELMO2',
  3: 'CREB3L1',
  4: 'RPS11',
  5: 'PNMA1',
  6: 'MMP2',
  7: 'SAMD4A',
  8: 'SMARCD3',
  9: 'A4GNT'},
 'TCGA-02-0001-01': {0: -0.5909,
  1: -0.9099,
  2: -2.3351,
  3: 0.2216,
  4: 0.6798,
  5: -2.48,
  6: 0.7912,
  7: -1.4578,
  8: -3.8009,
  9: 3.4868},
 'TCGA-02-0003-01': {0: -0.5154,
  1: 0.0896,
  2: -1.17,
  3: 0.1255,
  4: 0.2374,
  5: -3.2629,
  6: 1.2846,
  7: -1.474,
  8: -2.9891,
  9: -0.1511},
 'TCGA-02-0004-01': {0: 1.0314,
  1: 1.923,
  2: 0.0792,
  3: 2.5117,
  4: -0.741,
  5: -0.4146,
  6: 2.5185,
  7: 0.0058,
  8: 0.032,
  9: -1.7115},
 'TCGA-02-0007-01': {0: 0.6932,
  1: -5.6511,
  2: -2.8365,
  3: 2.0026,
  4: -0.6326,
  5: -1.3741,
  6: -3.437,
  7: -1.047,
  8: -4.185,
  9: 2.1816}

Example of the first few rows/columns of cna dataframe (as a dictionary):

{'Hugo_Symbol': {0: 'ACAP3',
  1: 'ACTRT2',
  2: 'AGRN',
  3: 'ANKRD65',
  4: 'ATAD3A',
  5: 'ATAD3B',
  6: 'ATAD3C',
  7: 'AURKAIP1',
  8: 'B3GALT6',
  9: 'C1orf159'},
 'TCGA-02-0001-01': {0: 0.242,
  1: 0.242,
  2: 0.242,
  3: 0.242,
  4: 0.242,
  5: 0.242,
  6: 0.242,
  7: 0.242,
  8: 0.242,
  9: 0.242},
 'TCGA-02-0003-01': {0: 0.007,
  1: 0.007,
  2: 0.007,
  3: 0.007,
  4: 0.007,
  5: 0.007,
  6: 0.007,
  7: 0.007,
  8: 0.007,
  9: 0.007},
 'TCGA-02-0006-01': {0: 0.005,
  1: 0.005,
  2: 0.005,
  3: 0.005,
  4: 0.005,
  5: 0.005,
  6: 0.005,
  7: 0.005,
  8: 0.005,
  9: 0.005},
 'TCGA-02-0007-01': {0: -0.072,
  1: -0.072,
  2: -0.072,
  3: -0.072,
  4: -0.072,
  5: -0.072,
  6: -0.072,
  7: -0.072,
  8: -0.072,
  9: -0.072}}

Example of the first few rows/columns of meth dataframe (as a dictionary):

{'Hugo_Symbol': {0: 'MEOX2',
  1: 'COX8C',
  2: 'IMPA2',
  3: 'TTC8',
  4: 'TMEM186',
  5: 'RETSAT',
  6: 'TULP1',
  7: 'TAF15',
  8: 'CCDC88B',
  9: 'EPB41L3'},
 'TCGA-02-0001-01': {0: 0.0414512366035448,
  1: 0.981055336789946,
  2: 0.0185800932177329,
  3: 0.0378532148581555,
  4: 0.85883476246704,
  5: 0.0289562030712872,
  6: 0.895705096206116,
  7: 0.0241634415603257,
  8: 0.645029762467676,
  9: 0.0255462091070292},
 'TCGA-02-0003-01': {0: 0.103889979051244,
  1: 0.989162923386702,
  2: 0.127842916524696,
  3: 0.0614877612251458,
  4: 0.718748599381741,
  5: 0.0370521775128637,
  6: 0.875983369162448,
  7: 0.0202500492785966,
  8: 0.338662158729623,
  9: 0.87970085305538},
 'TCGA-02-0006-01': {0: 0.0607886115184704,
  1: 0.979538805487233,
  2: 0.0196988790233788,
  3: 0.0617298630873257,
  4: 0.887325606857566,
  5: 0.0287770828033724,
  6: 0.936288398598192,
  7: 0.0212839350099898,
  8: 0.540677138952204,
  9: 0.167646943077876},
 'TCGA-02-0007-01': {0: 0.0255189003468858,
  1: 0.985665537780805,
  2: 0.0184844499535323,
  3: 0.08404493340112,
  4: 0.862945344666792,
  5: 0.0299693670464541,
  6: 0.930821357939767,
  7: 0.0223801816866076,
  8: 0.319356999493038,
  9: 0.0193433290278056}}

CodePudding user response：

Given that your dfs contain hashable values, you might try something like this:

dfs = [pd.DataFrame(np.random.randint(low=100, high=103, size=(5, 2)), columns=['A', 'B']) for _ in range(3)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])\
                                         for df in dfs]), columns=dfs[0].columns)
print(f"common: \n{common}")

CodePudding user response：

What exactly would the ideal output look like, is this what you're trying to do? I feel like there's some confusion between columns and rows...

dfs = [mrna, meth]

def do_stuff(df):
    df = pd.DataFrame(df).T
    df.columns = df.iloc[0]
    df = df[1:]
    return df

mrna, meth = map(do_stuff, dfs)

print(pd.concat([mrna, meth], axis=1).dropna())

Output:

Hugo_Symbol        AACS   FSTL1   ELMO2 CREB3L1   RPS11   PNMA1    MMP2  \
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351  0.2216  0.6798   -2.48  0.7912   
TCGA-02-0003-01 -0.5154  0.0896   -1.17  0.1255  0.2374 -3.2629  1.2846   
TCGA-02-0007-01  0.6932 -5.6511 -2.8365  2.0026 -0.6326 -1.3741  -3.437   

Hugo_Symbol      SAMD4A SMARCD3   A4GNT     MEOX2     COX8C     IMPA2  \
TCGA-02-0001-01 -1.4578 -3.8009  3.4868  0.041451  0.981055   0.01858   
TCGA-02-0003-01  -1.474 -2.9891 -0.1511   0.10389  0.989163  0.127843   
TCGA-02-0007-01  -1.047  -4.185  2.1816  0.025519  0.985666  0.018484   

Hugo_Symbol          TTC8   TMEM186    RETSAT     TULP1     TAF15   CCDC88B  \
TCGA-02-0001-01  0.037853  0.858835  0.028956  0.895705  0.024163   0.64503   
TCGA-02-0003-01  0.061488  0.718749  0.037052  0.875983   0.02025  0.338662   
TCGA-02-0007-01  0.084045  0.862945  0.029969  0.930821   0.02238  0.319357   

Hugo_Symbol       EPB41L3  
TCGA-02-0001-01  0.025546  
TCGA-02-0003-01  0.879701  
TCGA-02-0007-01  0.019343