I want to read into three omics datasets - methylation, cnv, and mRNA expression.
First, I removed duplicated columns and sort by index for all 3 dataframes mrna
, meth
, and cna
separately. These dataframes are then concatenated to form one dataframe common
.
Eventually, I want the common
dataframe to only contain rows that are the same in all three dataframes mrna
, meth
, and cna
.
import re
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
Now, I only want to keep the rows in common
if the row value is the same across all the three original dataframes mrna
, cna
, and meth
. Using the answer provided here (pandas - filter dataframe by another dataframe by row elements), this is my attempt:
# Keep only rows that are common in all three dataframes
mrna_keys = list(mrna.columns.values)
cna_keys = list(cna.columns.values)
meth_keys = list(meth.columns.values)
i_common_mrna = common.set_index(mrna_keys).index
i_common_cna = common.set_index(cna_keys).index
i_common_meth = common.set_index(meth_keys).index
i_mrna_common = mrna.set_index(mrna_keys).index
i_cna_common = cna.set_index(cna_keys).index
i_meth_common = meth.set_index(meth_keys).index
common = common[~i_common_mrna.isin(i_mrna_common) & ~i_common_cna.isin(i_cna_common) & ~i_common_meth.isin(i_meth_common)]
common
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-215-c824a7d76e25> in <module>()
15 meth_keys = list(meth.columns.values)
16
---> 17 i_common_mrna = common.set_index(mrna_keys).index
18 i_common_cna = common.set_index(cna_keys).index
19 i_common_meth = common.set_index(meth_keys).index
1 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
5449
5450 if missing:
-> 5451 raise KeyError(f"None of {missing} are in the columns")
5452
5453 if inplace:
Second attempt: Based on @lmielke's suggestion, I've tried the following alternative.
import re
dfs = [mrna, meth, cna]
# Keep only rows that are common in all three dataframes
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()]) for df in dfs]), columns=dfs[0].columns)
common
However, it returns 0 rows.
Example of the first few rows/columns of mrna
dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'AACS',
1: 'FSTL1',
2: 'ELMO2',
3: 'CREB3L1',
4: 'RPS11',
5: 'PNMA1',
6: 'MMP2',
7: 'SAMD4A',
8: 'SMARCD3',
9: 'A4GNT'},
'TCGA-02-0001-01': {0: -0.5909,
1: -0.9099,
2: -2.3351,
3: 0.2216,
4: 0.6798,
5: -2.48,
6: 0.7912,
7: -1.4578,
8: -3.8009,
9: 3.4868},
'TCGA-02-0003-01': {0: -0.5154,
1: 0.0896,
2: -1.17,
3: 0.1255,
4: 0.2374,
5: -3.2629,
6: 1.2846,
7: -1.474,
8: -2.9891,
9: -0.1511},
'TCGA-02-0004-01': {0: 1.0314,
1: 1.923,
2: 0.0792,
3: 2.5117,
4: -0.741,
5: -0.4146,
6: 2.5185,
7: 0.0058,
8: 0.032,
9: -1.7115},
'TCGA-02-0007-01': {0: 0.6932,
1: -5.6511,
2: -2.8365,
3: 2.0026,
4: -0.6326,
5: -1.3741,
6: -3.437,
7: -1.047,
8: -4.185,
9: 2.1816}
Example of the first few rows/columns of cna
dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'ACAP3',
1: 'ACTRT2',
2: 'AGRN',
3: 'ANKRD65',
4: 'ATAD3A',
5: 'ATAD3B',
6: 'ATAD3C',
7: 'AURKAIP1',
8: 'B3GALT6',
9: 'C1orf159'},
'TCGA-02-0001-01': {0: 0.242,
1: 0.242,
2: 0.242,
3: 0.242,
4: 0.242,
5: 0.242,
6: 0.242,
7: 0.242,
8: 0.242,
9: 0.242},
'TCGA-02-0003-01': {0: 0.007,
1: 0.007,
2: 0.007,
3: 0.007,
4: 0.007,
5: 0.007,
6: 0.007,
7: 0.007,
8: 0.007,
9: 0.007},
'TCGA-02-0006-01': {0: 0.005,
1: 0.005,
2: 0.005,
3: 0.005,
4: 0.005,
5: 0.005,
6: 0.005,
7: 0.005,
8: 0.005,
9: 0.005},
'TCGA-02-0007-01': {0: -0.072,
1: -0.072,
2: -0.072,
3: -0.072,
4: -0.072,
5: -0.072,
6: -0.072,
7: -0.072,
8: -0.072,
9: -0.072}}
Example of the first few rows/columns of meth
dataframe (as a dictionary):
{'Hugo_Symbol': {0: 'MEOX2',
1: 'COX8C',
2: 'IMPA2',
3: 'TTC8',
4: 'TMEM186',
5: 'RETSAT',
6: 'TULP1',
7: 'TAF15',
8: 'CCDC88B',
9: 'EPB41L3'},
'TCGA-02-0001-01': {0: 0.0414512366035448,
1: 0.981055336789946,
2: 0.0185800932177329,
3: 0.0378532148581555,
4: 0.85883476246704,
5: 0.0289562030712872,
6: 0.895705096206116,
7: 0.0241634415603257,
8: 0.645029762467676,
9: 0.0255462091070292},
'TCGA-02-0003-01': {0: 0.103889979051244,
1: 0.989162923386702,
2: 0.127842916524696,
3: 0.0614877612251458,
4: 0.718748599381741,
5: 0.0370521775128637,
6: 0.875983369162448,
7: 0.0202500492785966,
8: 0.338662158729623,
9: 0.87970085305538},
'TCGA-02-0006-01': {0: 0.0607886115184704,
1: 0.979538805487233,
2: 0.0196988790233788,
3: 0.0617298630873257,
4: 0.887325606857566,
5: 0.0287770828033724,
6: 0.936288398598192,
7: 0.0212839350099898,
8: 0.540677138952204,
9: 0.167646943077876},
'TCGA-02-0007-01': {0: 0.0255189003468858,
1: 0.985665537780805,
2: 0.0184844499535323,
3: 0.08404493340112,
4: 0.862945344666792,
5: 0.0299693670464541,
6: 0.930821357939767,
7: 0.0223801816866076,
8: 0.319356999493038,
9: 0.0193433290278056}}
CodePudding user response:
Given that your dfs contain hashable values, you might try something like this:
dfs = [pd.DataFrame(np.random.randint(low=100, high=103, size=(5, 2)), columns=['A', 'B']) for _ in range(3)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])\
for df in dfs]), columns=dfs[0].columns)
print(f"common: \n{common}")
CodePudding user response:
What exactly would the ideal output look like, is this what you're trying to do? I feel like there's some confusion between columns and rows...
dfs = [mrna, meth]
def do_stuff(df):
df = pd.DataFrame(df).T
df.columns = df.iloc[0]
df = df[1:]
return df
mrna, meth = map(do_stuff, dfs)
print(pd.concat([mrna, meth], axis=1).dropna())
Output:
Hugo_Symbol AACS FSTL1 ELMO2 CREB3L1 RPS11 PNMA1 MMP2 \
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351 0.2216 0.6798 -2.48 0.7912
TCGA-02-0003-01 -0.5154 0.0896 -1.17 0.1255 0.2374 -3.2629 1.2846
TCGA-02-0007-01 0.6932 -5.6511 -2.8365 2.0026 -0.6326 -1.3741 -3.437
Hugo_Symbol SAMD4A SMARCD3 A4GNT MEOX2 COX8C IMPA2 \
TCGA-02-0001-01 -1.4578 -3.8009 3.4868 0.041451 0.981055 0.01858
TCGA-02-0003-01 -1.474 -2.9891 -0.1511 0.10389 0.989163 0.127843
TCGA-02-0007-01 -1.047 -4.185 2.1816 0.025519 0.985666 0.018484
Hugo_Symbol TTC8 TMEM186 RETSAT TULP1 TAF15 CCDC88B \
TCGA-02-0001-01 0.037853 0.858835 0.028956 0.895705 0.024163 0.64503
TCGA-02-0003-01 0.061488 0.718749 0.037052 0.875983 0.02025 0.338662
TCGA-02-0007-01 0.084045 0.862945 0.029969 0.930821 0.02238 0.319357
Hugo_Symbol EPB41L3
TCGA-02-0001-01 0.025546
TCGA-02-0003-01 0.879701
TCGA-02-0007-01 0.019343